Saturday, June 27, 2020
Wednesday, May 6, 2020
Getting correct data on covid-19 cases is important to obtain up-to-date information on how the disease is progressing. It's also necessary create models and make accurate forecasts.
However, I'm starting to think most of the charts we see for covid cases over time are incorrect. The reason is these datasets with aggregate statistics by state scrape data and looks at the changes in 'headline' numbers - how many total cases of covid by day. However scraping the data only accounts for when new deaths were reported and not when they happened.
For instance take a look at deaths by covid for Colorado at the NY Times infographic here. One can see a dramatic spike of over 100 deaths in CO on April 24. However this due to how 'backfilling' and reporting on previously unreported cases. I believe this methodology permeates other data sources including the automatic graph on google, IHME, and covidtracking.
The official data can be found on the covid19 colorado website and looks like a smooth and somewhat symmetric curve. Below is a chart comparing the two datasets.
This appears to be a known issue but they still seem ok publishing these incorrect charts. It appears that this can give erroneous conclusions both to people looking at these charts and anyone trying to do an analysis with them.
It looks like the CDC is modeling on similar data. They use the variable "Daily change in cumulative COVID - 19 death counts" not "Daily Deaths" see slide 4. It also may be a reason that forecasting models are not doing so well.
Thursday, March 12, 2020
In my last post I discussed whether exports from china can explain current COVID-19 rates. There were some criticism with the analysis.
The first was that I assumed some sort of causal relationship . I probably shouldn't have said whether trade can 'explain' COVID-19 rates. What I meant whether trade can 'predict' the differences in COVID -19 rates. The mechanism itself seems obvious - increased contact should increase probability of spreading diseases. Of course $ amount of trade is imperfect - I expect $1million of food exports to have more of a chance to have diseases than $1 million of phone exports - but it seems like a decent proxy for 'connectivity'.
The biggest critique was that I didn't take into account population sizes. I thought that was a decent critique so have run results again and reproduced similar charts as before. Feel free to comment.
Below is a histogram of log per capital COVID-19 cases. Seems far more symmetric taking into account population sizes.
Below is scatterplot of log(covid-19 cases / population) ~ log(chinese exports) for each country. Looks like a fairly strong relationship. There are several small country outliers that reduces r^2 of regression to ~ .24. This is an increase in r^2 relative to not taking into account population.
I ran a similar model of a random forest with all sectors of trade. Below is predicted vs actual plot.
The relationship now is ~ .55 r^2 which shows is an increase in what was previously posted and the univariate regression above. Iran is still the largest outlier. Bahrain and Iceland are also now outliers.
Below is an updated importance plot. Nickel and sunflower seeds are still highly predictive of COVID-19 cases. I'm sure many of these correlations (not causations!) are geographic in nature. But some exports may be more likely to carry diseases like 'articles of gut'. I think the best way to treat the below graph is a rough exploratory device not meant to draw grand conclusions.
The conclusions don't dramatically change form previous analysis. If anything the results have gotten stronger.
Tuesday, March 10, 2020
I've updated some results. Find them here
TLDR: I've found an association between number of people that tested positive for COVID-19 in a country and imports from China. In addition there are particular industries that are particularly correlated with COVID-19 rates. Iran is still an outlier taking this information into account.
Coronavirus (COVID-19) rates of infection outside of China have not applied to countries uniformly. As of 3/9/2020 Italy (9,172), South Korea (7,578), and Iran (7,161) have a disproportionate number of cases relative to other countries such as the US (605). Below is a histogram showing the distribution of cases in each country and it's clear it is right skewed and non-uniform.
What causes such unequal distribution of COVID-19 cases among countries? I assume some degree of 'connectives' between countries would be a correlated with these rates. In particular I assume an increase in imports from China will lead to more cases in the importing country.
To investigate this I looked at data from https://oec.world/en/resources/data/. This includes bilateral trade data between countries and is broken down into hundreds of different categories of products. I'll look at a few things 1) are the outliers of Iran, Italy, and South Korea explained by trade with china? and 2) are some categories of traded products more correlated with rates of infection and 3) if so which products are more correlated?
1) Are Chinese Exports Correlated with COVID-19 Rates?
Below is a plot of COVID-19 rates by total Chinese exports for each country in 2017 (the latest data available). One can see a fairly strong association between these two variables. However there are a few outliers. Notably South Korea, Italy, and Iran are still above expected along with smaller countries and territories like San Marino.
Running a random forest of log(COVID-19-rates)~log(total_chinese_exports) yields an out-of-bag (OOB) r^2 of .12. So while relationship appears to be highly correlated there are enough outliers to decrease the 'explained' variance in infection rates.
2) Are there product categories among Chinese Exports that are more correlated with COVID-19?
To test this I ran a random forest of log(COVID-19-rates) on all export amounts broken out by categories. This is a very wide data set with 97 countries (observations) and over 1200 categories (explanatory variables). Even this wide dataset produced an OOB r^2 of .49 meaning that it captured much of variance not captured by previous univariate model.
Below a scatterplot of COVID-19 rates by these OOB random forest predictions. San Marino does not look like a large outlier anymore but Italy and South Korea do. Iran looks like a much larger outlier.
3) Which product categories are more correlated with COVID-19?
To answer this question I setup a model of same form as the random forest in 2) but used a lasso regression to apply variable selection. Below are the scaled coefficients kept by lasso regression.
One can see Chinese nickel powder and sunflower exports are more correlated with COVID-19 rates in importing country.
I looked at these and couldn't really make a story behind it. For instance the 7th most important feature had to do with swords and bayonets. I have no idea what this means :).
Conclusion:It appears trade is highly correlated with COVID-19 rates. In addition the evidence suggests that some industries are more correlated with COVID-19 rates than others (do to increase in r^2 between 1) and 2) ).
However, it's still not clear why countries like Iran have such high rates. Perhaps the data is stale since it was from 2017 or maybe Iran-Chinese trade is under reported in this dataset due to US sanctions.