Getting correct data on covid-19 cases is important to obtain up-to-date information on how the disease is progressing. It's also necessary create models and make accurate forecasts.
However, I'm starting to think most of the charts we see for covid cases over time are incorrect. The reason is these datasets with aggregate statistics by state scrape data and looks at the changes in 'headline' numbers - how many total cases of covid by day. However scraping the data only accounts for when new deaths were reported and not when they happened.
For instance take a look at deaths by covid for Colorado at the NY Times infographic here. One can see a dramatic spike of over 100 deaths in CO on April 24. However this due to how 'backfilling' and reporting on previously unreported cases. I believe this methodology permeates other data sources including the automatic graph on google, IHME, and covidtracking.
The official data can be found on the covid19 colorado website and looks like a smooth and somewhat symmetric curve. Below is a chart comparing the two datasets.
This appears to be a known issue but they still seem ok publishing these incorrect charts. It appears that this can give erroneous conclusions both to people looking at these charts and anyone trying to do an analysis with them.
It looks like the CDC is modeling on similar data. They use the variable "Daily change in cumulative COVID - 19 death counts" not "Daily Deaths" see slide 4. It also may be a reason that forecasting models are not doing so well.
Hi. Just wanted to say that this is not just a well known phenomenon, but an expectation in surveillance. The data are not necessarily incorrect; they are correct at the time of report and then changed as new information comes in. Those who work in surveillance are well aware, but you are correct in thinking that others may jump to premature conclusions. For example, policy makers are looking for a downward trend in the previous 14 days. But these are the wrong 14 days: they should be examining day -21 to day -7, since the most recent 7 days will be subject to the phenomenon you have pointed out.
ReplyDeleteThanks.
Rich Rothenberg rrothenberg@gsu.edu (use this is you wish to reply)
Hey Rich,
DeleteThanks for the comment but I think we're talking about two different issues.
You're saying that the official data will be under reported for a few weeks until they collect the data. That makes sense. The data point will eventually be included in the correct 'date of death' tally and one can be fairly confident the deaths are accurate after 2-3 weeks from official sources.
But the point I'm making is that most of these time series charts (like the NY Times) just report the change cumulative numbers at that point in time. So if there a huge backfill - like for CO on 4/24 - then the numbers won't necessarily be accurate for for that day. And since they don't go back and correct that number it won't ever be correct using this process.
Do you know of any data sources that aggregate it with correct backfills?
Thanks,
Sam