Sunday, December 6, 2015

Beyond Beta: Relationships between partialPlot(), ICEbox(), and predcomps()

Machine Learning models generally outperform standard regression models in terms of predictive performance. However, these models tend to be poor in explaining how they achieve a particular result. This post will discuss three methods used to peak inside "black box" models and the connections between them.

Ultimately the methods discussed here adhere to a similar principle: change the input and see out the output changes.

Suppose we have a model y=f(u,v) and we're interested in the effect of u on the output of the function keeping v constant. These methods change u while keeping other variables, v, constant. For a particular observation x_i; predict with all unique values of u while keeping x_i constant.

for q in  (unique values in u);

Now we have a pair of data of what the predicted value of y over the set of unique values of u keeping everything else constant.

Below is a plot that shows what happens when we plot these predicted values, changing u while keeping other variables, v, constant.  (The code was taken from from David Chudzicki's Blog Post). It is a logistic regression with two variables. (Example was chosen to demonstrate the connections between the functions described here. One generally wouldn't use these techniques with a logistic regression. These are more appropriate for complex ensemble or hierarchal models.) The plot shows what the prediction is if we kept one variable constant while the other variable changed. In this case the variable Price changes while keeping everything else constant. 

If we repeat this procedure for every single observation, we can see what ICEbox package does. Below are two plots; 1) ICE plots using predcomps code calculations and ggplot 2) ICE plots using ice() function:

The above plots show that the two functions have similar calculations.

partialPlot() in randomforest takes the mean at a particular unique value. partialPlot is basically an aggregated value of ICEbox. In this case the line in ICEbox chart (bold and colored) is the equivalent to partialPlot() results. There are many cases where interaction effects are masked by taking the average effect, so visualizing ICE curves is more informative than partiaPlot().

The goal of predcomps() isn't necessarily to visualize how variables are related, it's to find something similar to a "beta" coefficient for complex models. That is, it takes the average change in prediction / change in variable of interest (u). Following the example from the first plot, this means its the average slope of all predicted, unique data points relative to the original data point. The resulting graph is shown below. 

Above we see the original data point has a price of 104. It calculates the weighted average (the weight is the mahalanobis distance between the points) slope in regards to this point (lines shown as factor 0). To obtain the overall predcomps "beta" it does this for all observations (or as many as selected by the user).

This post started out predicted n samples from one observation to create a single observation plot over range of interest u. Over many observations this gives us a ICE plots. aggregating together gives us a partialplot, and finding the (weighted) slope in regards to original data gives us the predcomps beta.


predcomps website
Average Predictive Comparisons Paper
Peeking Inside the Black Box: VisualizingStatistical Learning with Plots of IndividualConditional Expectation

Wednesday, October 7, 2015

Slate Middle East Graph #2

I recently saw redid its Mideast Graphic and thought it would be good to redo my previous analysis. From the original slate graphic it's hard to make much sense of what's happening and who's friends with whom.

From the graph below we can see 3 clearly (more or less) defined groups. Nusra and ISIS are Jihadists and have no friends. There is a well defined Sunni group of Syrian Rebels, Turkey, and Saudi Arabia and Gulf States. Finally there appears to be a Shia cluster that includes Syria, Iran and Hezbollah, Iraq and Russia.

The Kurds appear to be in the Shia cluster above, but after removing ISIS and Nusra (since they have no allies), we can see that the Kurds and US are sort of outliers in this fight between Shia and Sunnis.

So there seems to be three defined groups: Jihadists, Shia, and Sunni fighting it out with Kurds in the crossfire and America supporting elements of both Shia and Sunni while Russia is supporting only Shia elements.  

Thursday, April 2, 2015

WW1 Monthly Casualties by Fronts and Belligerents

I've been reading a few books on WW1 and wanted to see a time series plot of battle casualty/pow by country to get a better understanding of how the conflict fits together. I couldn't find any database for military casualties in WW1 but Wikipedia does include casualty statistics for each battle. I wrote some code to scrape / parse it but the data was extremely messy and I had to go through each battle to obtain proper statistics. The dataset I've complied is incomplete, as it only includes major battles. 

Below is a time series plot of casualties for each front colored by country. Left side of each Front are Central Powers and right side are Allied Powers.

Without going through the entire war, I'll just observe a few points:

  • Offensives were seasonal (no major offensives during winter). 
  • The Western Front in WW1 in 1915 wasn't the main focus. During this time allies launched the Gallipoli Campaign, the beginning of Italian Front, and the Great Retreat in Russia. 
  • Russia experienced lower battle exchange rate compared with England/ France. 
  • There were 2 dramatically lopsided battles
  • No major battles at Eastern Front after mid 1917 (Russian Revolution made peace with Germany)
  • France had no major battles in 1917 (French Mutiny)
  • Large numbers of casualties were experienced up to the end of the conflict.
  • American casualties were very light compared with other countries but its impact was far greater than numbers suggest. It was the inevitability of defeat due to America's introduction into the conflict that caused Germany to sue for peace, not defeat in the field itself.

Sunday, March 22, 2015

Western Front Battle Exchange Rates



The Western Front in World War One is considered a a war of outdated tactics combined with brutal efficiency of technology. And while the war saw the introduction of several innovations (eg. tanks and airplanes) the general routine of heavy artillery bombardments followed by men "going over the top" to meet barbed wire and an entrenched enemy remained the defining (and very much repeated) way to fight in this war. Within this set tactics, were there advancements made in terms of fighting capability throughout the war?

In this post I'll develop a hierarchical model to observe the battle exchange rate at a particular point in time.  A battle exchange rate is ratio of casualty statistics for a particular country/battle. There were of course many advancements to tactics throughout the war. Whether these had any effect is unsure. I figure if we can see changes in battle exchange rates throughout time then we can conclude that changes in tactics had an effect.


The data includes all major battles in WW1 on Western Front from Wikipedia pages. Since the analysis does not include all battles, the statistics are incomplete.

Below are monthly casualty figures for the four major belligerents:

One can see the seasonality of fighting throughout the war. Major operations weren't planned for winter months.

The opening months of the war involved high casualties on both the French and German sides when the front was still fluid. By the end of 1914, the front was established and the next 3 years brought little change to the front. In the beginning of 1918, fresh from victory on the Eastern Front, the Germans went for broke on final offensive to defeat France and England before the US arrived en masse. It failed, and the Germans bowing to the inevitable, started surrendering in large numbers in wake of the the 100 days offensive the allies launched in the latter half of 1918. An armistice went into effect in November 1918 and effectively ended military confrontations for WW1 on Western Front.

Below are cumulated casualties:

While the number of US casualties is a fraction of those of all belligerents. It should be noted that US involvement was far more impactful than the casualty numbers suggest. The Germans didn't so much as lose WW1 on the field of battle as surrender to the inevitability of American's far superior men and material (an action which was not repeated in WW2).


Each battle has a start date, end date, and casualty figures for respective combatants. There is assumed to be a battle exchange rate coefficient for belligerent j at time t called beta(j,t).This follows a random walk beta(j,t)~normal(beta(j,t-1),sigmabeta(j)). There is also a specific Battle Error Beta for each battle k and country j which is added to beta(i,j). Battle Error Rate Beta is constrained to sum to zero across all battles.

Below is the output of simulated states of beta(j,t):

The allies were consistently worse fighters than the Germans. This isn't surprising considering: 1) the Germans were defending more than attacking and 2) they chose the best ground to dig its trenches. Belgium low rates of casualty exchange is due to the surrendering of forts of Liege and Namur. Interesting, England had a worse exchange rate than France. This is possibly due to the fact the English did more night raids than France and expected more from their troops. America had the highest exchange rate, although this might be because the US was only in the latter half of 1918 and was around when the Germans started to surrender, not necessarily because they were better fighters than English or French. 

The most obvious aspect of this graph shows the relative decline of German fighting power from the end of 1915 to the middle of 1917. I attribute this to Germany relocating much of its troops elsewhere during this period (in Eastern Front and propping up Austria-Hungary's fighting ability in Italy and Balkans Theater). 

It's difficult to tell whether tactics really effected the outcome. The most this analysis has showed was the marked decreased in fighting quality of Germans relative to Allies. 

To be done:
The model does assumes independent time error and battle error components among the countries. This is obviously not true. One can see that both time components of France and England are correlated. Likewise, if both England and France fight in same battle, they should be similarly disadvantaged/advantaged. In order to capture these effects, correlated error terms would be necessary.