Tuesday, June 7, 2016

Economist Liquidity Index

Thursday, May 12, 2016

When the Predictions are more accurate than the Response

The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency 

##create documents - each document consists of 100 paragraphs 
documents=lapply(1:100, function(x) rnorm(100))

##create topics for each paragraph - 100 documents with a 100 paragraphs each
paragraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))

## a document consists of several paragraphs but the document assigned the most common topic
document_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)

unlisted_documents=unlist(documents)
unlisted_class=rep(document_classes, each=100)

##the original correct classficiation rate is around 54%
table(unlist(paragraph_classes),unlisted_class)
##    unlisted_class
##     FALSE TRUE
##   0  2643 2353
##   1  2257 2747
sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)
## [1] 0.539
##build a model. each paragraph response variable is the original class assigned
glm=glm(unlisted_class~unlisted_documents, family="binomial")


#the predicted classification using model on data - can see classification results are over 14% greater than original classficiations
predicted_values=scale(predict(glm))
table(predicted_values>0,unlist(paragraph_classes)>.5)
##        
##         FALSE TRUE
##   FALSE  3386 1647
##   TRUE   1610 3357
sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)
## [1] 0.6743

Monday, January 11, 2016

Are Turbocharged Engines More Fuel Efficient?

Introduction:

My previous post discussed methods to uncover effects of a particular explanatory variable on a response variable in machine learning models. These work by changing the variable of interest and measure how much the change in output, is keeping all other variables constant. However, this assumption that we can hold constant other variables is almost surely incorrect. In most applications, changing one variable changes the distributions of other covariates as well.

In this post I will show the effect of a change in values by simulating a dataset, keeping only variables of interest fixed while allowing all other variables to change. I will do this by simulating data points from a method described here. I'll compare this method with existing techniques.

The topic I'm going to explore is whether turbocharged engines actually decrease fuel economy. Over the past few years turbocharging is increasingly used for engines that reduce engine size while increasing power and fuel economy and reducing emissions. It has been hailed as the solution to increasing power while decreasing fuel consumption. A common theme among proponents is that one can achieve V-6 power through turbocharging I-4 engines and thereby achieve I-4 fuel consumption. However, real world driving has suggested that a turbocharged option doesn't necessarily reduce consumption, it merely gives you the option to reduce consumption if you don't use all the power.

Data:
The car data was scraped from car and driver website and included 416 car reviews with "observed mpg" along with 15 other variables including; weight, horsepower, torque, zero to 60 time, etc.

I'll compare how observed fuel economy compares with a turbo I-4 and naturally aspirated V-6 with the same 0 to 60 time (6.5 seconds). The two comparisons are:

Turbo Regime: {Engine Breathing=Turbo, Engine_Type=I-4, Zero.60=6.5, Date=Most recent}
Naturally Aspirated Regime: {Engine Breathing=Naturally Aspirated, Engine_Type=V-6, Zero.60=6.5, Date=Most recent}

I picked these examples because changing from  V-6 to a I-4 is a very common choice and 0-60 of 6.5 is a fairly standard time for a mid range family sedan.

My previous post discussed the methodology of using everything as constant. In this case I will predict each observation twice under the differences (one with turbo 4 and one with naturally aspirated v-6) and compare the predictions.


f(mpg.observed | other covariates, Turbo Regime) - f(mpg.observed | other covariates, Naturally Aspirated Regime)

Where f() is approximated by a machine learning method. In this case it is a random Forest. Below is a histogram of the results.
Using this method, the mean change in observed mpg is -.028. In addition there is a wide variation in changes (1st Qu is -.077 and 3Qu is 0.025). This results appear that turbos have much of an effect on fuel consumption.

However, it's clear that changing the engine from a I-4 Turbo to a V-6 naturally aspirated engine could change other variables as well. For instance, I-4 engines are lighter than V-6 engines (2 fewer cylinders after all) so its possible that changing from Turbo Regime to Naturally Aspirated regime decreases fuel consumption through other channels.

From this post, we can simulate the distributions:

P(mpg.observed, other covariates | Turbo Regime) and P(mpg.observed, other covariates |  Naturally Aspirated Regime). We can the see the differences in distributions with these simulations.

Below is a pairs plot of several variables under each of the different regimes.




The mean mpg observed under I-4 Turbo regime is 25.57 and is 24.29 for V-6 indicating that I-4 Turbos are, on average, 1.28 mpg more fuel efficient than similar performing V-6. It would appear that this is due, at least partially, to weight decrease of a turbo engine. That is, turbo engines are associated with lighter cars (on average by 150 pounds) and that reduces fuel consumption.

Conclusion:
This post compared existing methods of predicting changes with one that simulates the distribution under different conditions. The resulting distribution allows other covariates to change an expected amount given a change in other variables. Using existing methods found no change in average fuel consumption because the gain in fuel efficiency isn't directly caused by turbocharging an engine but through other channels like decrease in weight.


Code and Data

Sunday, January 3, 2016

Sampling Arbitrary data

Introduction:
Generating data usually requires a variance - covariance matrix and is therefore restricted by using a linear assumption between the variables. However, using a linear assumption between data can miss important non - linear relationships. This post uses quantile random forests to simulate an arbitrary dataset.

MCMC Sampling:
MCMC sampling is a method that can simulate from any multivariate densities as long as one has the full conditionals. Full conditionals in this case are models of a particular variable given all other variables. For more concreteness; suppose we have a dataset with n variables (x1, x2, ... xn). The estimated full conditionals in this case are:

f(x1 | x2, x3...xn)
f(x2 | x1, x3...xn)
.
.
f(xn | x1, x3...x(n-1))

Where f() is a machine learning algorithm of choice that can provide a distribution of values, in this case I use quantregForest() for continuous variables and randomForest() for categorical variables. 

Once the models are built the algorithm is as follows:
- Choose random observation as starting point
-for each iteration of N iteration
     - for each variable
           - Sample proposal observation from predicted distribution and compare likelihood with current observation. Accept proposal with min(p(proposal)/p(current),1)

Code can be found here.

Results:
A random forest model was built to predict whether the simulated data could be distinguished from original data. The resulting KS test (out of sample) was insignificant so we retain null that model can't distinguish between original and simulated data. 

While that is a good result and shows that a model can't distinguish between simulated and original data, visually we can see a difference between the two datasets shown below. 

Original Iris Data

 Simulated Iris Data Set

The simulated dataset shows sharper boundaries between points that are not found in original dataset. For example, the scatterplot between Sepal.Length and Sepal.Width shows a sharp boundary at approximately 5 Sepal.Length in simulated data but it more gradual in original data.


Conclusion:
This post discussed a method to simulate an arbitrary dataset. While I cannot build a model to distinguish the two datasets, visually they are distinguishable.

***Updated with code from Dmytro Perepolkin - Thanks!

Sunday, December 6, 2015

Beyond Beta: Relationships between partialPlot(), ICEbox(), and predcomps()

Machine Learning models generally outperform standard regression models in terms of predictive performance. However, these models tend to be poor in explaining how they achieve a particular result. This post will discuss three methods used to peak inside "black box" models and the connections between them.

Ultimately the methods discussed here adhere to a similar principle: change the input and see out the output changes.

Suppose we have a model y=f(u,v) and we're interested in the effect of u on the output of the function keeping v constant. These methods change u while keeping other variables, v, constant. For a particular observation x_i; predict with all unique values of u while keeping x_i constant.

for q in  (unique values in u);
 fhat(q)=predict(q,v)

Now we have a pair of data of what the predicted value of y over the set of unique values of u keeping everything else constant.

Below is a plot that shows what happens when we plot these predicted values, changing u while keeping other variables, v, constant.  (The code was taken from from David Chudzicki's Blog Post). It is a logistic regression with two variables. (Example was chosen to demonstrate the connections between the functions described here. One generally wouldn't use these techniques with a logistic regression. These are more appropriate for complex ensemble or hierarchal models.) The plot shows what the prediction is if we kept one variable constant while the other variable changed. In this case the variable Price changes while keeping everything else constant. 


If we repeat this procedure for every single observation, we can see what ICEbox package does. Below are two plots; 1) ICE plots using predcomps code calculations and ggplot 2) ICE plots using ice() function:
 

The above plots show that the two functions have similar calculations.

partialPlot() in randomforest takes the mean at a particular unique value. partialPlot is basically an aggregated value of ICEbox. In this case the line in ICEbox chart (bold and colored) is the equivalent to partialPlot() results. There are many cases where interaction effects are masked by taking the average effect, so visualizing ICE curves is more informative than partiaPlot().

The goal of predcomps() isn't necessarily to visualize how variables are related, it's to find something similar to a "beta" coefficient for complex models. That is, it takes the average change in prediction / change in variable of interest (u). Following the example from the first plot, this means its the average slope of all predicted, unique data points relative to the original data point. The resulting graph is shown below. 

Above we see the original data point has a price of 104. It calculates the weighted average (the weight is the mahalanobis distance between the points) slope in regards to this point (lines shown as factor 0). To obtain the overall predcomps "beta" it does this for all observations (or as many as selected by the user).

This post started out predicted n samples from one observation to create a single observation plot over range of interest u. Over many observations this gives us a ICE plots. aggregating together gives us a partialplot, and finding the (weighted) slope in regards to original data gives us the predcomps beta.

Code


Resources:
predcomps website
Average Predictive Comparisons Paper
Peeking Inside the Black Box: VisualizingStatistical Learning with Plots of IndividualConditional Expectation