Many statisticians, epidemiologists, economists and data scientists have registered their serious reservations regarding the reported coronavirus case-counts. Comparing countries and states using those case-counts seem inappropriate when every nation/state have adopted different testing strategies and protocols. Estimating prevalence of COVID-19 based on these data is a hopeless exercise and several groups have recently argued for estimating the number of truly infected cases by using mortality rates. In this project, we aim to (a) posit a conceptual mathematical framework to characterize sampling bias, misclassification/imperfection of the test, and heterogeneity in the reproductive number simultaneously on the estimation of the prevalence rate, (b) review current testing strategies in some of the countries where we have testing data, and (c) provide guidelines for testing strategy/disease surveillance that may help track the pulse of the epidemic, to identify disease free areas and identify disease outbreaks.
This project includes the code needed to reproduce results. This includes (A) sourcing both US and World testing (B) algorithmic development, and (C) application of models to the cleaned datasets. If using this code please cite the paper using the following bibtex:
@article{dempsey:2020,
author = {Du, Jiacong and Dempsey, Walter and Mukherjee, Bhramar},
title = {The Hypothesis of Testing: Paradoxes arising out of reported coronavirus case-counts},
booktitle = {arXiv},
year = {2020}}
If there are steps to run the code list them as follows:
- Dependencies: all code is developed in Python using Anaconda.
- The Anaconda environment can be installed using covid.yml. See here for instructions on creating the environment. Simply open Anaconda shell, open to github repo and run:
conda env create -f covid.yml
- Datasets and exploratory data analysis
- World testing data is accessed here and country population totals is accessed here
- US testing data is accessed here and US population totals is accessed here. For AS, GU, MP, and VI are extracted from here
- Exploratory data analysis is presented as a set of ipython notebooks. Descriptive statistics are used to inform the prior on the measurement-error models using in the analysis phase
- The methods directory contains all algorithms for estimation under selection bias, measurement-error, and heterogeneity. Algorithms are developed within the pymc3.
- All evaluation functions can be found in the the evaluation directory. In particular, we perform posterior predictive checks to confirm model fit to the data.
- Final report can be found in the write-up directory