Thursday, April 2, 2015

WW1 Monthly Casualties by Fronts and Belligerents


I've been reading a few books on WW1 and wanted to see a time series plot of battle casualty/pow by country to get a better understanding of how the conflict fits together. I couldn't find any database for military casualties in WW1 but Wikipedia does include casualty statistics for each battle. I wrote some code to scrape / parse it but the data was extremely messy and I had to go through each battle to obtain proper statistics. The dataset I've complied is incomplete, as it only includes major battles. 

Below is a time series plot of casualties for each front colored by country. Left side of each Front are Central Powers and right side are Allied Powers.






Without going through the entire war, I'll just observe a few points:

  • Offensives were seasonal (no major offensives during winter). 
  • The Western Front in WW1 in 1915 wasn't the main focus. During this time allies launched the Gallipoli Campaign, the beginning of Italian Front, and the Great Retreat in Russia. 
  • Russia experienced lower battle exchange rate compared with England/ France. 
  • There were 2 dramatically lopsided battles
  • No major battles at Eastern Front after mid 1917 (Russian Revolution made peace with Germany)
  • France had no major battles in 1917 (French Mutiny)
  • Large numbers of casualties were experienced up to the end of the conflict.
  • American casualties were very light compared with other countries but its impact was far greater than numbers suggest. It was the inevitability of defeat due to America's introduction into the conflict that caused Germany to sue for peace, not defeat in the field itself.






Sunday, March 22, 2015

Western Front Battle Exchange Rates

Introduction:

Github

The Western Front in World War One is considered a a war of outdated tactics combined with brutal efficiency of technology. And while the war saw the introduction of several innovations (eg. tanks and airplanes) the general routine of heavy artillery bombardments followed by men "going over the top" to meet barbed wire and an entrenched enemy remained the defining (and very much repeated) way to fight in this war. Within this set tactics, were there advancements made in terms of fighting capability throughout the war?


In this post I'll develop a hierarchical model to observe the battle exchange rate at a particular point in time.  A battle exchange rate is ratio of casualty statistics for a particular country/battle. There were of course many advancements to tactics throughout the war. Whether these had any effect is unsure. I figure if we can see changes in battle exchange rates throughout time then we can conclude that changes in tactics had an effect.


Data:

The data includes all major battles in WW1 on Western Front from Wikipedia pages. Since the analysis does not include all battles, the statistics are incomplete.

Below are monthly casualty figures for the four major belligerents:


One can see the seasonality of fighting throughout the war. Major operations weren't planned for winter months.

The opening months of the war involved high casualties on both the French and German sides when the front was still fluid. By the end of 1914, the front was established and the next 3 years brought little change to the front. In the beginning of 1918, fresh from victory on the Eastern Front, the Germans went for broke on final offensive to defeat France and England before the US arrived en masse. It failed, and the Germans bowing to the inevitable, started surrendering in large numbers in wake of the the 100 days offensive the allies launched in the latter half of 1918. An armistice went into effect in November 1918 and effectively ended military confrontations for WW1 on Western Front.


Below are cumulated casualties:

While the number of US casualties is a fraction of those of all belligerents. It should be noted that US involvement was far more impactful than the casualty numbers suggest. The Germans didn't so much as lose WW1 on the field of battle as surrender to the inevitability of American's far superior men and material (an action which was not repeated in WW2).

Analysis:

Each battle has a start date, end date, and casualty figures for respective combatants. There is assumed to be a battle exchange rate coefficient for belligerent j at time t called beta(j,t).This follows a random walk beta(j,t)~normal(beta(j,t-1),sigmabeta(j)). There is also a specific Battle Error Beta for each battle k and country j which is added to beta(i,j). Battle Error Rate Beta is constrained to sum to zero across all battles.

Below is the output of simulated states of beta(j,t):






The allies were consistently worse fighters than the Germans. This isn't surprising considering: 1) the Germans were defending more than attacking and 2) they chose the best ground to dig its trenches. Belgium low rates of casualty exchange is due to the surrendering of forts of Liege and Namur. Interesting, England had a worse exchange rate than France. This is possibly due to the fact the English did more night raids than France and expected more from their troops. America had the highest exchange rate, although this might be because the US was only in the latter half of 1918 and was around when the Germans started to surrender, not necessarily because they were better fighters than English or French. 

The most obvious aspect of this graph shows the relative decline of German fighting power from the end of 1915 to the middle of 1917. I attribute this to Germany relocating much of its troops elsewhere during this period (in Eastern Front and propping up Austria-Hungary's fighting ability in Italy and Balkans Theater). 

Conclusion:
It's difficult to tell whether tactics really effected the outcome. The most this analysis has showed was the marked decreased in fighting quality of Germans relative to Allies. 

To be done:
The model does assumes independent time error and battle error components among the countries. This is obviously not true. One can see that both time components of France and England are correlated. Likewise, if both England and France fight in same battle, they should be similarly disadvantaged/advantaged. In order to capture these effects, correlated error terms would be necessary. 


Tuesday, October 7, 2014

Predicting Monthly Car Sales: The Residuals are the Story

I'll produce predictions for US car sales by manufacture every month. There are already several blogs that describe the industry and sales that do a great job. Autoblog by the Numbers and Counting Cars are some to mention. 

Unlike their analysis, I'll try to focus on the residuals (the stuff I can't predict) to tell the story. To highlight the difference, I think its instructive to look at what Autoblog mentioned.

The Autoblog article (link above) highlights Mitsubishi for increasing sales. However, my prediction for Mitsubishi sales are pretty much exactly what the sales were. In essence, given this model, we didn't learn much. On the other hand, Land Rover and Jaguar had the largest residuals (in percent terms) and Land Rover and Acura had the largest deviance (Residual / Variance). I think these results are more telling because we didn't predict them correctly; something might have changed. 

I'll publish this on a new blog: datAutomotive.

Here is a  Shiny App For Car Sales and below are graphs / tables of my current analysis and future predictions. 







Predicted Values 9/14Actual Values 9/14log(Predicted/Actual)Deviance
Acura488.036576.3330.1661.731
Audi632.252621.542-0.0170.326
BMW1,061.3291,066.0830.0040.032
Buick649.648727.7500.1140.571
Cadillac568.021576.2080.0140.089
Chevrolet6,284.5946,411.3750.0200.135
Chrysler1,058.3921,199.2080.1250.710
Dodge1,892.4651,834.167-0.0310.172
Ford7,374.6697,177.542-0.0270.285
GMC1,602.2601,594.542-0.0050.029
Honda4,434.7804,349.625-0.0190.174
Hyundai2,291.3062,333.7500.0180.271
Infiniti315.217326.5420.0350.234
Jaguar37.83447.5830.2290.908
Jeep2,491.3782,301.292-0.0790.853
Kia1,853.7341,692.833-0.0911.115
Land.Rover159.975129.417-0.2121.816
Lexus1,011.967910.500-0.1061.189
Lincoln285.903302.3750.0560.394
Mazda1,074.272999.167-0.0720.743
Mercedes.Benz1,186.2201,230.1250.0360.364
Mini196.975175.792-0.1140.626
Mitsubishi231.582231.5830.000000.00002
Nissan4,176.9953,963.250-0.0530.516
Porsche156.496150.292-0.0400.408
Subaru1,727.5971,729.8750.0010.026
Toyota6,928.4726,059.458-0.1341.309
Volkswagen1,123.4121,083.167-0.0360.407
Volvo161.531194.4580.1861.070




Predicted Values for 10/14
Acura521.878
Audi590.125
BMW1,051.492
Buick666.165
Cadillac533.468
Chevrolet5,693.145
Chrysler1,097.288
Dodge1,493.426
Ford6,694.012
GMC1,572.525
Honda4,049.582
Hyundai1,998.896
Infiniti262.475
Jaguar40.422
Jeep2,132.652
Kia1,650.236
Land.Rover160.884
Lexus944.946
Lincoln272.964
Mazda894.234
Mercedes.Benz1,184.991
Mini183.743
Mitsubishi193.982
Nissan3,629.955
Porsche158.540
Subaru1,694.661
Toyota5,895.316
Volkswagen923.242
Volvo153.691

Wednesday, August 6, 2014

Predicting Monthly Car Sales for Brands in US: First Step

I've set out to produce monthly forecasts of monthly car sales by brand in the US. So far I've made a SUTSE dynamic linear model (code on Github) and created a Shiny app (http://sweiss.shinyapps.io/carvis/) as a prototype (no predictions yet).

Basically, I want to combine my interest in cars with my interest in stats.

My current roadblock is the how slow the code runs, but I intend to have a speedy implementation using RStan by next month.

If anyone has tips or suggestions, please comment!






Predictions for September:
Acura & 13764.75 \\
  Audi & 15929.24 \\
  BMW & 26445.80 \\
  Buick & 23725.69 \\
  Cadillac & 16722.01 \\
  Chevrolet & 204636.69 \\
  Chrysler & 26336.27 \\
  Dodge & 51082.20 \\
  Fiat & 4408.64 \\
  Ford & 228880.56 \\
  GMC & 50361.17 \\
  Honda & 143487.90 \\
  Hyundai & 67112.10 \\
  Infiniti & 10366.93 \\
  Jaguar & 1162.72 \\
  Jeep & 60577.05 \\
  Kia & 53863.38 \\
  Land.Rover & 4736.08 \\
  Lexus & 30614.06 \\
  Lincoln & 7766.23 \\
  Mazda & 29544.19 \\
  Mercedes.Benz & 28693.43 \\
  Mini & 5455.48 \\
  Mitsubishi & 6706.69 \\
  Nissan & 125118.79 \\
  Porsche & 3914.65 \\
  Ram & 39194.54 \\
  Smart & 1258.65 \\
  Subaru & 46186.21 \\
  Toyota & 213761.40 \\
  Volkswagen & 31581.44 \\
  Volvo & 4809.14 \\






Wednesday, July 23, 2014

Mideast Graph 3: Slate Middle East Friendship

Slate recently published a great info-graphic about Middle-East Relationships. It shows the relationships (Friend, Enemy, or Complicated) of 13 countries / organizations and the relationships between each pair. One draw back of the chart is that it doesn't show if countries / organizations cluster among each other. This post will attempt to rectify that drawback. 

From a previous post, I used an matrix decomposition of the laplacian matrix and plotted the smallest two eigenvectors against each other. The graphs below follow the same methodology and you can find code here. To give weights to the network matrix, I assigned values of 1, -1, and 0 for Friend, Enemy, and Complicated. 

Below is the resulting graph with red lines indicating Enemy and blue lines indicated Friends.  




The above graph shows 3 possible clusters. Turkey, Palestinian Authority, and Hamas are close to each other in lower left corner. In top right there is United States, Egypt, and Israel. Lower Left shows Iraq, Syria, Hezbollah and Iran.  

Since both ISIS and Al-Qaida have no Friend relationships and the goal of this was to see if there were clusters of friends / enemies I removed these organizations and redid analysis. Below is the result.


Here we can see much clearer distinction of relationship clusters. United States, Israel, and Egypt form one with Saudi Arabia as possibly a 'member' of that group. Iraq, Syria, Hezbollah and Iran have a defined cluster, which makes sense as they are all dominated by Shias. Finally, there is Turkey, Palestinian Authority and Hamas. I was surprised to see Turkey so close with Palestinian Authority and Hamas. But seeing as Turkey has been a supporter of Palestinians now and in past, maybe I shouldn't be. Also, surprising that there are no Arab countries close with Palestinians now.

The Mideast Conflict is sort of a 3 way standoff with Shias in one corner, Sunnis friendly to US / Israel in another, and Turkey trying to gain influence in a region it once controlled in the final corner.


Github

Thursday, May 22, 2014

didYouMean() Function: Using Google to correct errors in Strings

A function that will take a String as an input and return the "Did you mean.." or "Showing Results for.." from google.com. Good for misspelled names or locations.


library(RCurl)
##if on windows might need: options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
didYouMean=function(input){
  input=gsub(" ", "+", input)
  doc=getURL(paste("https://www.google.com/search?q=",input,"/", sep=""))
  
  
  dym=gregexpr(pattern ='Did you mean',doc)
  srf=gregexpr(pattern ='Showing results for',doc)
  
  
  if(length(dym[[1]])>1){
    doc2=substring(doc,dym[[1]][1],dym[[1]][1]+1000)
    s1=gregexpr("?q=",doc2)
    s2=gregexpr("/&",doc2)
    new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
    return(gsub("[+]"," ",new.text))
    break
  }
  
  else if(srf[[1]][1]!=-1){
    doc2=substring(doc,srf[[1]][1],srf[[1]][1]+1000)
    s1=gregexpr("?q=",doc2)
    s2=gregexpr("/&",doc2)
    new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
    return(gsub("[+]"," ",new.text))
    break
  }
  else(return(gsub("[+]"," ",input)))
}  

So didYouMean("gorecge washington") returns "george washington"


Works well with misspelled companies or nouns or phrases. For example; you're doing text analysis on twitter and a customer raves about Carlsburg beer. Only problem is he's enjoying their product while tweeting (something that happens only rarely, I'm sure) and wrote "clarsburg gprou". Not to worry!

> didYouMean("clarsburg gprou")
[1] "carlsberg group"

Or suppose you have a 3 phase plan for profits. This can help you get there!

didYouMean("clletc nuderpants")
[1] "collect underpants"

Saturday, May 17, 2014

Modelling This Time is Different: Corrected

New PDF
New Code


I made two errors in my previous post.

The first is that I put Probability in the utility function. Generally, this is a no no where the E[utility]=sum over i: P(outcome i)*U(outcome i). I therefore changed it to a more simple maximization problem (no Lagrangian multipliers necessary) where the individual maximizes E[Profits].

 The second problem has to deal with maximizing subject to probabilities. Since I sampled from the joint posterior distribution of unknown parameters, I had a number of draws from the distribution. What I did in the previous analysis was maximize each pair of simulated draws individually, and then averaged over these maximized results to get what I thought was the optimal result. In general, this method does not result in the optimal value. I should have maximized all pairs simultaneously. Basically I did E[max s of f(s,p)] instead of max s of E[f(s,a)].

In general, the results are superficially similar to my original analysis. Even if the results are largely the same, its best to describe my mistakes upfront, and avoid awkward questions later.