sweissblaug: January 2016

Introduction:

My previous post discussed methods to uncover effects of a particular explanatory variable on a response variable in machine learning models. These work by changing the variable of interest and measure how much the change in output, is keeping all other variables constant. However, this assumption that we can hold constant other variables is almost surely incorrect. In most applications, changing one variable changes the distributions of other covariates as well.

In this post I will show the effect of a change in values by simulating a dataset, keeping only variables of interest fixed while allowing all other variables to change. I will do this by simulating data points from a method described here. I'll compare this method with existing techniques.

The topic I'm going to explore is whether turbocharged engines actually decrease fuel economy. Over the past few years turbocharging is increasingly used for engines that reduce engine size while increasing power and fuel economy and reducing emissions. It has been hailed as the solution to increasing power while decreasing fuel consumption. A common theme among proponents is that one can achieve V-6 power through turbocharging I-4 engines and thereby achieve I-4 fuel consumption. However, real world driving has suggested that a turbocharged option doesn't necessarily reduce consumption, it merely gives you the option to reduce consumption if you don't use all the power.

Data:
The car data was scraped from car and driver website and included 416 car reviews with "observed mpg" along with 15 other variables including; weight, horsepower, torque, zero to 60 time, etc.

I'll compare how observed fuel economy compares with a turbo I-4 and naturally aspirated V-6 with the same 0 to 60 time (6.5 seconds). The two comparisons are:

Turbo Regime: {Engine Breathing=Turbo, Engine_Type=I-4, Zero.60=6.5, Date=Most recent}

Naturally Aspirated Regime: {Engine Breathing=Naturally Aspirated, Engine_Type=V-6, Zero.60=6.5, Date=Most recent}

I picked these examples because changing from V-6 to a I-4 is a very common choice and 0-60 of 6.5 is a fairly standard time for a mid range family sedan.

My previous post discussed the methodology of using everything as constant. In this case I will predict each observation twice under the differences (one with turbo 4 and one with naturally aspirated v-6) and compare the predictions.

f(mpg.observed | other covariates, Turbo Regime) - f(mpg.observed | other covariates, Naturally Aspirated Regime)

Where f() is approximated by a machine learning method. In this case it is a random Forest. Below is a histogram of the results.

Using this method, the mean change in observed mpg is -.028. In addition there is a wide variation in changes (1st Qu is -.077 and 3Qu is 0.025). This results appear that turbos have much of an effect on fuel consumption.

However, it's clear that changing the engine from a I-4 Turbo to a V-6 naturally aspirated engine could change other variables as well. For instance, I-4 engines are lighter than V-6 engines (2 fewer cylinders after all) so its possible that changing from Turbo Regime to Naturally Aspirated regime decreases fuel consumption through other channels.

From this post, we can simulate the distributions:

P(mpg.observed, other covariates | Turbo Regime) and P(mpg.observed, other covariates | Naturally Aspirated Regime). We can the see the differences in distributions with these simulations.

Below is a pairs plot of several variables under each of the different regimes.

The mean mpg observed under I-4 Turbo regime is 25.57 and is 24.29 for V-6 indicating that I-4 Turbos are, on average, 1.28 mpg more fuel efficient than similar performing V-6. It would appear that this is due, at least partially, to weight decrease of a turbo engine. That is, turbo engines are associated with lighter cars (on average by 150 pounds) and that reduces fuel consumption.

Conclusion:
This post compared existing methods of predicting changes with one that simulates the distribution under different conditions. The resulting distribution allows other covariates to change an expected amount given a change in other variables. Using existing methods found no change in average fuel consumption because the gain in fuel efficiency isn't directly caused by turbocharging an engine but through other channels like decrease in weight.

Code and Data

Introduction:

Generating data usually requires a variance - covariance matrix and is therefore restricted by using a linear assumption between the variables. However, using a linear assumption between data can miss important non - linear relationships. This post uses quantile random forests to simulate an arbitrary dataset.

MCMC Sampling:
MCMC sampling is a method that can simulate from any multivariate densities as long as one has the full conditionals. Full conditionals in this case are models of a particular variable given all other variables. For more concreteness; suppose we have a dataset with n variables (x1, x2, ... xn). The estimated full conditionals in this case are:

f(x1 | x2, x3...xn)

f(x2 | x1, x3...xn)

f(xn | x1, x3...x(n-1))

Where f() is a machine learning algorithm of choice that can provide a distribution of values, in this case I use quantregForest() for continuous variables and randomForest() for categorical variables.

Once the models are built the algorithm is as follows:

- Choose random observation as starting point

-for each iteration of N iteration

- for each variable

- Sample proposal observation from predicted distribution and compare likelihood with current observation. Accept proposal with min(p(proposal)/p(current),1)

Code can be found here.

Results:

A random forest model was built to predict whether the simulated data could be distinguished from original data. The resulting KS test (out of sample) was insignificant so we retain null that model can't distinguish between original and simulated data.

While that is a good result and shows that a model can't distinguish between simulated and original data, visually we can see a difference between the two datasets shown below.

Original Iris Data

Simulated Iris Data Set

The simulated dataset shows sharper boundaries between points that are not found in original dataset. For example, the scatterplot between Sepal.Length and Sepal.Width shows a sharp boundary at approximately 5 Sepal.Length in simulated data but it more gradual in original data.

Conclusion:
This post discussed a method to simulate an arbitrary dataset. While I cannot build a model to distinguish the two datasets, visually they are distinguishable.

***Updated with code from Dmytro Perepolkin - Thanks!

sweissblaug

Monday, January 11, 2016

Are Turbocharged Engines More Fuel Efficient?

Sunday, January 3, 2016

Sampling Arbitrary data

About Me

Blog Archive