data | ||
img | ||
.gitignore | ||
clean.ipynb | ||
eda.ipynb | ||
model.ipynb | ||
readme.md |
Analyzing Fuel Economy
Predicting MPG of a vehicle using a linear regression model. Success will be evaluated using r-squared and root mean square error scores over time as well as testing with some unseen, futuristic, and very different data compared to the training set.
If you're curious about what goes into fuel economy, this should provide some insight on how it all works.
Contents
- Clean - Quick look to find any missing values, data of wrong types. Make sure data is in ranges that make sense
- EDA - Investigate outliers and other items of interest. Manufacture some new features
- Model - Model and make predictions. Introduce some outside data
Libs
- pandas
- numpy
- seaborn
- os
- matplotlib
- Ipython
- sklearn
The Data
Using https://archive-beta.ics.uci.edu/ml/datasets/auto+mpg
-
Title: Auto-Mpg Data
-
Sources: (a) Origin: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. (c) Date: July 7, 1993
-
Past Usage:
- See 2b (above)
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
-
Relevant Information:
This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".
"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)
-
Number of Instances: 398
-
Number of Attributes: 9 including the class attribute
-
Attribute Information:
- mpg: continuous
- cylinders: multi-valued discrete
- displacement: continuous
- horsepower: continuous
- weight: continuous
- acceleration: continuous
- model year: multi-valued discrete
- origin: multi-valued discrete
- car name: string (unique for each instance)
-
Missing Attribute Values: horsepower has 6 missing values
Summary
A bit on engines:
- A most basic description of an engine is that it's an air pump
- Horsepower = (Torque * RPM) / 5252
- Torque peak is where an engine is operating most efficiently as far as air flow, applied science in action. (Fluid dynamics, resonance)
- Operating above or below the torque peak reduces efficiency and efficiency == fuel economy
- Torque peaks normally occur below 5252rpm, and horsepower peaks above that, so long as the engine can actually rev that high. On a dyno sheet (measuring torque and horsepower vs rpm) you'll see the torque/horsepower lines cross at 5252rpm
- As an engine spins faster, the power output increases until combustion is so inefficient and it produces so little torque that spinning faster produces no more power, if it holds together that long
Basically an engine that makes lots of power at high rpm but relatively little low end torque (mazda rotary), is going to have poor fuel economy because it spends most of its time outside of its efficiency range during normal driving. In contrast, diesel engines typically turn lower rpms and create all kinds of torque down low. So not only do they start off making more torque but they are less likely to stray very far from torque peak during normal driving. This is also why horsepower numbers on a diesel appear low, because they can't rev as high. There's more to it than this but this should be enough to provide context.
Features
From the original features I chose:
- mpg: target
- number of cylinders
- engine displacement (in cubic inches)
- Horsepower
- Total weight of vehicle
From these I then calculated:
- Efficiency: HP per cubic inch - as this increases so does MPG
- Load: cubic inches per lb of weight - metric of how hard the engine has to work compared to its size. Engines that work hard use more fuel and a small engine working really hard can use more fuel than a big engine not doing much
- Bore_size: cubic inches per cylinder - best attempt to describe cylinder bore diameter which gives insight on torque curve
- Grunt: bore size divided by efficiency - an attempt to describe the power curve of an engine, or more specifically the presence/absence of low rpm torque output
All of these features are continuous.
Model
Into the model:
- Horsepower
- Displacement
- Number of cylinders
- Weight
Which then gets turned into:
- Horsepower
- Bore_size
- Grunt
- Load
Then sent through:
- Quantile Transformer
- Linear Regression