Predicting an epidemic with faulty data

As COVID-19 spread from China, it was hard to predict how and where it would spread, but I had an hypothesis.

Jonny Axelsson
4 min readMar 17, 2020

In January and February countries across the world started reporting cases of COVID-19, but how many cases were unreported? Places like South Korea started extensive testing, finding cases that might take weeks to reach the hospitals, if at all. At the same time places like Iran hid cases for political reasons.

If all you know is the daily number of cases, deaths and recoveries, not e.g. how many had been hospitalised or tested or age/health of patients, can you make an informed guess on how affected each country is going to be, even when every country had their own reporting strategy?

You can’t, but my guess was that of those numbers, the reported number of deaths was most honest. It wouldn’t be right either, both in Wuhan and Iran there were reports of deaths being hidden, and elsewhere too cases could be reported as e.g. pneumonia, multiple organ failure, or simply death at home. But compared to confirmed cases, so dependent on strategy, political motivations, and the health system, death is pretty honest.

Hypothesis: Countries with an unusual high ratio of dead patients to cases would have a wider spread of COVID-19 than their reported number of cases would indicate.

So I made a spreadsheet, copy/pasting data from Wikipedia (which in turn got them from John Hopkins), to test that hypothesis, beginning March 6. I took all the data, in case I would test other approaches later. Or feel free to do so yourself.

Test: Select countries with at least 1 dead or 50 cases, no dead. Exclude China as outlier (having had the outbreak about 40 days earlier). Then sort after death rate (#dead/#cases).

Prediction: Those with a high death rate would end up having higher than average cases later. The world cases March 16 (excluding China) is 5.11 times higher than March 6. So high death rate countries should have more cases, low death rate fewer.

Outcome: No such thing. There is a slight correlation with higher than average death rate (at March 6 it was 1.9% for the world outside Mainland), but this effect was drowned out by other factors. Primarily that Asia (outside the Philippines) had much fewer confirmed cases than would be expected, and Europe (including the US and the Philippines) had much more.

25 COVID19 countries with death rate March 6 (red) compared with confirmed cases March 16 (indigo)

Discussion: The impact of relatively random outbreaks seems larger than deaths, even though deaths should be a relatively early indicator of outbreaks and community spread. South Korea before Daegu, Iran before Qom, Italy and Europe before where-it-broke-out. But this should already be apparent in the numbers, and it is hard to see that they are.

Policies may obviously also mattered. At the day of March 6 Italy and South Korea looked very similar by the numbers. March 16 they do not. It doesn’t seem that we can pick up policy by a snapshot of case-to-death ratios. Perhaps by a sequence of numbers. E.g. numbers that are unchanged for a period of time, then suddenly jumping, might indicate laxer policies?

Age matters greatly for this disease. But again Italy and South Korea are fairly similar here. But younger Iran should be worse off with same number of deaths. This may be bad news for the even younger continent of Africa.

Finally, timing. Back in March 6, all of 11 days ago, COVID-19 hadn’t truly broken out in public yet (except the hot spots of South Korea, Iran and Italy), the numbers were so small that a few cases could skew them. Maybe the story will be different March 6 to March 26? But this too diminish the value of reported deaths as an early indicator of spread.

To really see if there is a valuable correlation or not, a more thorough analysis will be necessary, but this test is discouraging.

What about death rate to death rate?

If a country is lax or lying March 6, wouldn’t it be the same 10 days later? What about comparing the (presumptive more honest) death rates of March 16 and 6 instead? Well, it is basically the same story, just with some more fluctuations.

Comparing death rates March 6 to March 16 (both red).

tl;dr Summary

There is a lot of discrepancies between countries on reported COVID19 statistics, more than I would expect is by chance. But if there is a way to spot and compensate for poor data, a simple death-to-case ratio is unlikely to be it.

What now?

The data are up there, for instance in that spreadsheet I made. Maybe you have an idea. If you do: Go for it!

--

--