Saturday, June 27, 2020

The mr_uplift package in R: A Practitioners Guide to Trade-Offs in Uplift Models

Wednesday, May 6, 2020

Covid Death Rates: Is the data correct?

Getting correct data on covid-19 cases is important to obtain up-to-date information on how the disease is progressing. It's also necessary create models and make accurate forecasts. 

However, I'm starting to think most of the charts we see for covid cases over time are incorrect. The reason is these datasets with aggregate statistics by state scrape data and looks at the changes in  'headline' numbers - how many total cases of covid by day. However scraping the data only accounts for when new deaths were reported and not when they happened.  

For instance take a look at deaths by covid for Colorado at the NY Times infographic here. One can see a dramatic spike of over 100 deaths in CO on April 24. However this due to how 'backfilling' and reporting on previously unreported cases. I believe this methodology permeates other data sources including the automatic graph on google,  IHME, and covidtracking

The official data can be found on the covid19 colorado website and looks like a smooth and somewhat symmetric curve. Below is a chart comparing the two datasets. 


This appears to be a known issue but they still seem ok publishing these incorrect charts. It appears that this can give erroneous conclusions both to people looking at these charts and anyone trying to do an analysis with them. 

It looks like the CDC is modeling on similar data. They use the variable "Daily change in cumulative COVID - 19 death counts" not "Daily Deaths" see slide 4.  It also may be a reason that forecasting models are not doing so well









Thursday, March 12, 2020

Can Trade with China Predict COVID-19 Cases? Part 2


In my last post I discussed whether exports from china can explain current COVID-19 rates. There were some criticism with the analysis.

The first was that I assumed some sort of causal relationship . I probably shouldn't have said whether trade can 'explain' COVID-19 rates. What I meant whether trade can 'predict' the differences in COVID -19 rates.  The mechanism itself seems obvious - increased contact should increase probability of spreading diseases. Of course $ amount of trade is imperfect - I expect $1million of food exports to have more of a chance to have diseases than $1 million of phone exports - but it seems like a decent proxy for 'connectivity'.

The biggest critique was that I didn't take into account population sizes. I thought that was a decent critique so have run results again and reproduced similar charts as before. Feel free to comment.


Below is a histogram of log per capital COVID-19 cases. Seems far more symmetric taking into account population sizes.






Below is scatterplot of log(covid-19 cases / population) ~ log(chinese exports) for each country. Looks like a fairly strong relationship. There are several small country outliers that reduces r^2 of regression to ~ .24. This is an increase in r^2 relative to not taking into account population.




I ran a similar model of a random forest with all sectors of trade. Below is predicted vs actual plot. 


The relationship now is ~ .55 r^2 which shows is an increase in what was previously posted and the univariate regression above. Iran is still the largest outlier. Bahrain and Iceland are also now outliers.

Below is an updated importance plot. Nickel and sunflower seeds are still highly predictive of COVID-19 cases. I'm sure many of these correlations (not causations!) are geographic in nature. But some exports may be more likely to carry diseases like 'articles of gut'. I think the best way to treat the below graph is a rough exploratory device not meant to draw grand conclusions.






































The conclusions don't dramatically change form previous analysis. If anything the results have gotten stronger.

github

Tuesday, March 10, 2020

Can Trade Explain Covid -19 Cases?


I've updated some results. Find them here

TLDR: I've found an association between number of people that tested positive for COVID-19 in a country and imports from China. In addition there are particular industries that are particularly correlated with COVID-19 rates. Iran is still an outlier taking this information into account. 

Intro:

Coronavirus (COVID-19) rates of infection outside of China have not applied to countries uniformly. As of 3/9/2020 Italy (9,172), South Korea (7,578), and Iran (7,161) have a disproportionate number of cases relative to other countries such as the US (605). Below is a histogram showing the distribution of cases in each country and it's clear it is right skewed and non-uniform. 

What causes such unequal distribution of COVID-19 cases among countries? I assume some degree of 'connectives' between countries would be a correlated with these rates. In particular I assume an increase in imports from China will lead to more cases in the importing country. 

To investigate this I looked at data from https://oec.world/en/resources/data/. This includes bilateral trade data between countries and is broken down into hundreds of different categories of products. I'll look at a few things 1) are the outliers of Iran, Italy, and South Korea explained by trade with china? and 2) are some categories of traded products more correlated with rates of infection and 3) if so which products are more correlated?


1) Are Chinese Exports Correlated with COVID-19 Rates? 

Below is a plot of COVID-19 rates by total Chinese exports for each country in 2017 (the latest data available). One can see a fairly strong association between these two variables. However there are a few outliers. Notably South Korea, Italy, and Iran are still above expected along with smaller countries and territories like San Marino. 

Running a random forest of log(COVID-19-rates)~log(total_chinese_exports) yields an out-of-bag (OOB) r^2 of .12. So while relationship appears to be highly correlated there are enough outliers to decrease the 'explained' variance in infection rates. 




2) Are there product categories among Chinese Exports that are more correlated with COVID-19? 

To test this I ran a random forest of log(COVID-19-rates) on all export amounts broken out by categories. This is a very wide data set with 97 countries (observations) and over 1200 categories (explanatory variables). Even this wide dataset produced an OOB r^2 of .49 meaning that it captured much of variance not captured by previous univariate model. 

Below a scatterplot of COVID-19 rates by these OOB random forest predictions. San Marino does not look like a large outlier anymore but Italy and South Korea do. Iran looks like a much larger outlier. 


3) Which product categories are more correlated with COVID-19? 
To answer this question I setup a model of same form as the random forest in 2) but used a lasso regression to apply variable selection. Below are the scaled coefficients kept by lasso regression. 



One can see Chinese nickel powder and sunflower exports are more correlated with COVID-19 rates in importing country.

I looked at these and couldn't really make a story behind it. For instance the 7th most important feature had to do with swords and bayonets. I have no idea what this means :).

Conclusion:
It appears trade is highly correlated with COVID-19 rates. In addition the evidence suggests that some industries are more correlated with COVID-19 rates than others (do to increase in r^2 between 1) and 2) ).

However, it's still not clear why countries like Iran have such high rates. Perhaps the data is stale since it was from 2017 or maybe Iran-Chinese trade is under reported in this dataset due to US sanctions.



Wednesday, January 1, 2020

What Were IRA Facebook Objectives in 2016 Election?


The Internet Research Agency (IRA), funded by friends of Russian Intelligence, used social media to try to influence the US 2016 election. They did so in an elaborate and systematic fashion. While the number of purchased ads and money spent on Facebook was small there were significant resources devoted to this endeavor as a whole. 

It’s overall objectives have been to explicitly help the more non-traditional candidates in the 2016 presidential campaign (p 23 of Mueller Report) and, more broadly, to sow distrust in the American populace towards its institutions (p 4 of Mueller Report). It used a variety of tools such as bots, fake twitter accounts, and advertising on Facebook to do so.

However, it’s unclear what the IRA optimize when they made these Facebook Advertisement. Part of the issue is the difference between outputs and the outcomes of these campaigns (terminology taken from Hostile Social Manipulation from Rand). Outputs are the observable metrics that can be tied to a Facebook campaign (eg likes, impressions, etc) while outcomes are the desired changes in public opinion the advertisements are trying to change. Since the outcomes are unobservable they would have had to use the outcomes of these campaigns as proxies. What were their objectives while looking at these proxies?
Looking into this is interesting for a couple reasons. The first is that learning about IRA objectives might tell us how to better combat these campaigns in the future. There is no reason to suggest the IRA or other nations will stop attempting to manipulate populations through social media so this problem will only get bigger so any research might help. Another reason is that, as a data scientist myself, I could learn something in how to advertise more effectively.
This analysis will attempt to answer what particular objectives the IRA were looking to optimize with regards to the information they had available for Facebook ads; Clicks, Impressions, and Costs. I've found evidence that the IRA adapted it’s ad placement to be more cost effective in terms of click through rates suggesting they tried to optimize some sort of Clicks/Cost metric.

General Idea of Analysis

The general idea is to think like a data scientist. If I were trying to optimize something I would look at the performance of previous campaigns and make future decisions according to my objectives. The decisions in this case are what kind of ads to deploy to which people. While I do have the text associated with each ad I will ignore that information for now and focus only on whom the ad targeted (Facebook allows a marketer to target based on just about any demo, interest, or behavior. Focusing on this should make the problem more tractable).
I try to simulate this activity in two steps for a given time period.
1  -  The first is to select a period in time and build ridge regression’s on the five available metrics and interactions (Clicks, Costs, Impressions and Clicks/Costs, Impressions/Costs I added because I thought they'd be useful) using ad features (targets like 'users between ages of 18-65' or 'liked MLK') as explanatory variables.  Each explanatory variable will therefore have 5 regression coefficients.
2  -  If they were updating ads based on previous information I’d expect to find some correlations between previous performances (estimated coefficients of step 1) and future decisions. Therefore I treat these regression coefficients as explanatory variables in a second step; regressing on number of times that feature occurred in ads in the subsequent period.
An elaboration of this method and what it is trying to capture is in order. The IRA may explore different subsets of people to target ads to and adapt as they gain more information. They can see themselves what features are positively / negatively correlated with metrics, something we are trying to recreate in step one. The IRA would then make decisions on future ads. 
Suppose the IRA was interested in maximizing impressions without regards to costs or clicks. In the second step we would expect to see a positive relationship between features that had large positive coefficients for impressions but no relationship with coefficients for costs or clicks. Thats the theory anyways.
I do not expect this to be the ‘true’ data generation process. All I’m trying to do is get features that can roughly capture the efficacy of ads targeting and future decision making. To some degree this is kind of hacky way of trying to do Inverse Reinforcement Learning.

Data And EDA

The US government and Facebook found released ads that were purchased by the IRA. The data I used was found at https://russian-ira-facebook-ads.datasettes.com/ and they cleaned up the metadata very nicely. However, only those ads that paid in Rubles (yes they didn't hide their tracks too much) had costs data. I subsetted data to those observations. In addition I focused on 2016 election and right after of dates between 2016-01-01 and 2017-02-01. This left with 1296 ads in the data set.
Below is a pairs plot of all output metrics in log scale.

Pairs Plot of Metrics






One thing to note is the frequency of ‘horizontal’ points for costs. Since These are clustered around integer values it is most likely due to ‘max spend’ limits facebooks allows purchases to do.
The highest correlations are between Impressions and Clicks. Correlation between Impressions and Costs are also high most likely because IRA chose to be charged by Impressions and not Clicks which is another option.

For targeting data I used only those that have more than 10 observations and left with a 240 targets. Below is a plot of most common targeted data. 






After some generic ones (NewsFeed, Desktop, English) one can see they focused heavily on African Americans. In a previous blog I mentioned how these really took off in summer / fall of 2016 right after Manafort gave Russians some as yet to be disclosed polling data (but thats another story).

Results

For a particular date I looked at ads that were deployed between 28 days prior and up to that date for step 1) described above. I then look at the subsequent 7 days following that day and count number of times each topic has been deployed in an ad. I do this calculation for every week for dates between 2016-01-01 and 2017-02-01. There are 57 overall time periods and 240 features. Combined we have 13680 ‘observations’ of regression coefficients along with. The reasons why I chose a 28 day lookback for step 1) and 7 day look forward for step 2) was fairly arbitrary. I did try other dates and got largely similar results. 

Below is what one time period coefficients look like for Costs and Clicks.


This was during the week of the 2016 Election. Each point is a topic along with it’s estimate regression coefficient on each axis. The size is how many subsequent times that topic was named in the subsequent week. For example; targeting users with the "United States" was associated with more costs and more clicks and it was one of the commonly used target in subsequent 7 day period. Conversely the 'Pan Africanism' is associated with relatively low costs and clicks.

I included an x=y line for clarity. One can see that most observations are above this line indicating that some sort of click per cost metric is important.


We can see all time periods / coefficients in a pairs plot below.
The the first five variables are the estimated coefficients for all targets for all time periods. The final column is each coefficients following weeks number of occurrences. The highest correlation among estimated features is between Costs and Impressions adding further evidence that they are charged by impressions and not clicks. 
We also see that Costs and impressions are negatively correlated with counts while clicks are positively related. I take this to mean that IRA was interested in maximizing its non paying metric of clicks while minimizing costs of and subsequently impressions. Finally, the clicks/costs metric is slightly more correlated with counts that impressions/ costs topics.
For these reasons I ran a regression of log(counts+1) (l Iogged due to skewness in data) on costs, clicks. and clicks/costs while dropping impressions features. I did so because there is too much collinearity between impressions and costs and I think they tracked the same thing.

Below is a regression output of those features with time dummy features and the number of times the topic was targeted during current time period (time features excluded from output due to length).


Observations13680

Est.S.E.t val.p
(Intercept)-0.260.05-5.350.00
scale(coefs_costs)-0.080.02-3.460.00
scale(coefs_clicks)0.080.023.190.00
scale(coefs_clicks_cost)0.020.021.060.29
scale(log(counts_topics_current + 1))0.640.0199.510.00
*removed Date Dummies for readability
As expected there is a negative relationship with regards to costs but a positive one with regards to clicks and both are stat sig.
To further look into this relationship I built a random forest model with same features as above (but not clicks / costs since the model should be able to make that interaction). To see if these estimated coefficients actually 'add' anything I build two random forests; one with only date and current counts and one with those features in addition to costs and click coefficients. The second model increases out of sample % var explained by ~ 1.2% and explains ~57.7% of variance. So the added predictability is small but there!
Below is a partial dependence plot of costs and clicks coefficient explanatory variables.
This shows that the IRA was focused on minimizing costs while maximizing clicks and is in broad agreement to regression output printed above. 

Conclusions


Overall it appears that IRA did try to maximize clicks on Facebook ads while minimizing costs. I've described these results to some friends and got the response 'well yea ok - that makes sense'. It's not a very surprising result.. it's almost banal. But it does underscores the fact they're using similar techniques as I might for helping a business... its just they're trying to f*** with my country.

This did not answer whether the IRA changed public opinion or had defining influence and 'hacked' the 2016 election. I don't think there is enough public information to answer that question. (side rant: I bet Facebook could come up with a decent analysis of that. They know who saw the ads / posts of IRA and know similar people who didn't see ads and where they probably voted. Couldn't Facebook could look at precinct level results and do some sort of ecological inference?)

But even if they didn't hack the election it seems like they did some decent analysis on Facebook ads that goes above and beyond what Facebook gives to its ad purchasers. Facebook does not give performance indicators on a target level basis. But the evidence presented here suggests they did some decent analysis on target level attributes. They might have hacked Facebook and that gave me an idea.

github code