k-sys / covid-19 Goto Github PK

View Code? Open in Web Editor NEW

1.4K 87.0 435.0 14.28 MB

A collection of work related to COVID-19

Jupyter Notebook 100.00%

covid-19's People

Contributors

Stargazers

Watchers

Forkers

snowdj gauss256 nborwankar mkrech alimrizvi juliantenschert sudharsanasai dmitriykats1 cdrappi markusbkoch taidnguyen pgaryali micheletizzoni anujamajmundar vicgalle willettk myeong nikhilgk pauldevos rohithd991 hiraksarkar mahesh4eva jaimeic tonywong99 fagan2888 gmstanle farhanreynaldo supachaya2535 yopuedoiralbano saumyapandey1005 srxdev0619 mlarocca gdsttian slachiewicz alexsisu awnion shnem biogeek sweetpand antgr kevinxperese guistein aitorcastel vinven7 danglive jaesvi geetua datakindfp sanzgiri massie aero-cfd abhinayakrishna vicnent kamande oakrvector jflam pchandiwal-livongo z-dba nicoweiner amsaha josephhughes alexhusted rsignell-usgs skaiphd mightynasty slachterman-g milan-chicago raovinnakota etothexipi kpelechrinis mattknox longinthetooth gaylonalfano ccfo theazuregeek mralbu corydolphin shivatharun ilyaselitser mjc808 futurice emildi gupta05-web gbacon frac saliedugit cesar-rocha yellowsimulator adripurkayastha adamlerer gilesvangruisen lanthias u6yuvi jeongjunjjkim grnjstn pablocelayes abbasnikbakht gaoxx643 fractaldesigns dgullate

covid-19's Issues

Puerto Rico specifically filtered out?

I spent some time looking at the data to see why Puerto Rico was not being included in the final results despite being in the underlying data, only finally to realize that we're not in there because you'd specifically removed us.

Why?

Case Count Mismatch with COVID Tracking Project: WA

The rt.live site cites The COVID Tracking Project as the source for case data; however, there are a few cases where the data on the COVID Tracking Project do not appear to match the case estimates displayed on rt.live or in the rt.csv file.

Specifically, for Washington State, the rt.csv file lists 202 new cases for April 26th and 342 cases for April 25th. The daily.csv from the COVID tracking project lists the positive cases as stuck at 13,521 from April 25th through April 27th. The increases on either side of this data gap from the COVID tracking project match exactly with the case increases in the rt.csv, but I am not sure where the values for the 26th and 27th came from. Is there documentation around how these numbers were generated, and why they may differ from the COVID Tracking Project results?

Loading Patient Info no longer works with latest data

The date validation logic below:

# Convert both to datetimes
patients.Confirmed = pd.to_datetime(
    patients.Confirmed, format='%d.%m.%Y')
patients.Onset = pd.to_datetime(
    patients.Onset, format='%d.%m.%Y')

# Only keep records where confirmed > onset
patients = patients[patients.Confirmed >= patients.Onset]

fails because of some invalid dates in the latest version of the data

Only the data up to May 13 works.

Further, the data file is also gzipped because of GH limits and the notebook needs to be updated to handle this.

Resources

You might find something helpful in here:

https://github.com/tomByrer/awesome-coronavirus#awesome-coronavirus--

Overshoot on WA, maybe could apply signal processing techniques?

The WA chart zooming past 1 and bouncing off 0 reminds me very much of the sort of overshoot you get in signal processing (where you go past the target level), which is typically followed by ringing (where you oscillate around the target level for a while before settling down). I'm not well-versed enough in the science to see exactly where/how this might be applied in your model, but the standard fix is to attenuate any of the high frequency inputs (e.g. with a low pass filter). Perhaps you can solicit input from someone with relevant experience, I think it could help to make these more accurate. It seems particularly relevant if trying to make these usable for policy, since you really don't want knee jerk reactions to artifacts that aren't really part of the data.

Wrong number of args

In Realtime Rt mcmc.ipynb there is line which reads
models[state] = create_and_run_model(grp.droplevel(0))

However, the function create_and_run_model expects two arguments
def create_and_run_model(name, state)

Hence, I think the line should read
models[state] = create_and_run_model(state, grp.droplevel(0))

Rt estimates early in outbreaks are low

Most estimates of R0 for COVID19 are north of 2. (eg Los Alamos paper on Wuhan.)

I would expect most the Rt estimates for most states to begin around R0 (before any responses were made), and then decrease as societal and personal behavior changed.

In epiforecasts.io, this is the case: the 50% credible interval is above 2 for the first week in each of the states they highlight, and the median prediction on March 9 in New Jersey, Illinois, Pennsylvania, is around 2.

In the current rt.live model, all point estimates for Rt are below 2, including on March 9.

To diagnose this, it may be worth simulating some data for which Rt begins at 2.5 and then decreases steadily over a month to 1 or below. There are a myriad of potential causes, including the value of sigma used in the brownian motion and edge effects. Besides, a simulation or two will make a good sanity check for the model in general!

Very cool to see all the progress in the last week, btw.

How do you adjust for changes in testing volumes?

I didn’t see an adjustment for testing volume. Aren’t you over calculating Rt without because your daily sample size is growing? Can you run this against hospitalizations, a much more consistent data input?

Derivation of connection between Rt and λ

We wanted to derive properly expression from initial code for connecting λ and Rt with @matteofrigo
In the code https://github.com/k-sys/covid-19 it is written that the derivation is beyond the scope of this notebook and the paper by Bettencourt is cited, but in the paper the derivation still lacks the proper estimate of the integral of k(t). Maybe someone had the same issues?
Thank you.

Cannot do slice indexing

I'm getting the following error:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.datetimes.DatetimeIndex'> with these indexers [[0]] of <class 'numpy.ndarray'>
I tried multiple states but that didn't help. Please let me know if this is something simple. Thank you!

Worth attempting to handle the undercounting/testing issue?

First, I want to say this is a wonderful project @k-sys - thank you very much for doing this. As I think everyone is aware, there are issues with calculating R in real-time based on confirmed case counts because of issues with testing capacity however. So R could be falling dramatically, but if testing is only capable of capturing 5% of actual cases, you may not see a measurable decrease in confirmed case counts as changes in testing capacity swamp the confirmed case counts.

As result, tracking R based on confirmed case is mostly just tracking testing capacity. As an example, here's the chart of New York's R based on the current approach.

That spike around March 18 coincides perfectly with when New York massively ramped up their testing capacity, going from an average of about ~1200 tests a day over the preceding 5 days to about 11,000 a day (almost a 10x increase).

One simple fix is to switch to counting deaths, which should be more reliable, but this will introduce a lag of about 2 weeks into the data, which obviously defeats the purpose of tracking R in real time.

As an alternative, I'd like to propose another method I've been experimenting with which seems to give more sensible results, albeit at the cost of introducing some editorial judgment into the data. This approach is to:

Using a 14 day lagging period between deaths and confirmed case counts, and some estimate of IFR (e.g. 0.5%), back into an estimate of how many 'true' cases there were up to 14 days ago.
Within the most-recent 14 day window, estimate the number of new cases which would be coming in if testing were flat within that 14-day window by tracking changes in the percentage of tests coming back positive. So if using Step 1 above, we estimate that New York had 100K new cases on April 7th at a 43% positive rate, and the positive rate dropped to 41% on April 8th, we would estimate approximately 95.3K cases on April 8th. Repeat daily.

Note that while IFR estimates are heavily debated, this approach is actually not sensitive to the IFR estimate since R is tracking case growth, not total cases. It is very sensitive to the 14 day window however.

Using this approach generates a chart like this New York:

For context, the NBA season and most mass events were suspended around March 11th. Many business also went work-from-home around then, although formal shelter-in-place orders were not issued until later. I find this chart somewhat more plausible looking, especially for earlier in March.

I'm using Kalman filter here to smooth out differences in daily testing percentages. This approach could be further improved by using a Gaussian distribution to estimate 'true' cases in the past using deaths, rather than assuming a flat 14 window.

It may also be possible to create a better estimate of 'true' cases within the 14 day window rather than assuming they vary linearly as a function of percentage of tests coming back positive (i.e., by adding features about the number of tests performed on a given day, or hospitalization recorded some number of days ago, etc...). I have experimented with some GBDT models like this, but have not been impressed with the results so far. @kpelechrinis has suggested using a Gopertz curve to project deaths forward as well, which performs well out-of-sample, but is still using fundamentally lagging data.

I appreciate there are possible downsides to this approach, namely that this will turn this into an opinionated model of real-time R, rather than just observing some ground truth. But I think there's also value to trying to account for these testing issues.

I don't know if that's out of the scope of the vision of this project however. So I'm interested in getting thoughts as to whether this would be worth trying to incorporate.

Day 2 formula

Quite interesting approach. Thank you.
Can someone please explain the formula of day 2. It's a bit confusing how to obtain
\sum_{R_1} {P(R_1|k_1) P(R_2|R_1) L(k_2|R_2|)
Thanks :)

Adjusted onset not being used for sampling

Great notebooks. I was reviewing the notebook because I'm planning to use for some modelling and I found that block of code:

def create_and_run_model(name, state):
    confirmed = state.positive.diff().dropna()
    onset = confirmed_to_onset(confirmed, p_delay)
    adjusted, cumulative_p_delay = adjust_onset_for_right_censorship(onset, p_delay)
    return MCMCModel(name, onset, cumulative_p_delay).run()

The adjusted series, which is the onset with the right censorship adjusted is being computed but it is not the one used on the MCMCModel later, is there a reason for this or is it a typo ?

How to set "cutoff" parameter?

In the notebook, cutoff=25 is used as default, and in case of len(smoothed)==0, cutoff is set for 10.
What I'd like to know is how these parameters are set and there are any problems in using smaller values(e.g. cutoff=1).

June 25 model status

Is there an ETA as to when the updated notebook with the latest model will be published?

hdi question

Great work thanks for doing this.
My question is about the hdi which purports to show the uncertainty in the derived Rt.
It looks perfectly reasonable to me but if we look at the country data run through the same process here:

https://github.com/sanzgiri/covid-19/blob/master/Realtime_R0_by_country.ipynb

and we look specifically at New Zealand which has a high testing rate and low infection rate - due to lockdown and cluster management - then we see the hdi increasing.

image where hdi seems wrong

This seems to mean we might be drawing the wrong conclusions about uncertainty.

Is there a way to factor in the testing per head of population here ?
Is that even the right thing to do ?
or is the low number of positive cases screwing up something else ?

Wrong application of Bayesian Inference

First disclaimer: I'm not a statistician or epidemiologist, I'm an engineer with a good feel for system modeling.

Second disclaimer: you're doing some great work and trying to get real-time estimates for Rt is extremely important.

I think you are making 1 assumption that is causing the statistics to behave badly - that you can take an approach that models a static R and with a small tweak make it model a time-varying Rt. Specifically the assumption that "we use yesterday's prior $P(R_{t-1})$ to estimate today's prior $P(R_t)$." This assumption only works for a static R, which is why your initial approach converged to a value of R=1. The initial approach of windowing or the new approach of adding Gaussian noise try to work around this issue without fixing it.

The underlying equation we're trying to estimate is: $\lambda = k_{t-1}e^{\gamma(R_0-NPI_t-1)}$

Where $R_t = R_0 - NPI_t$ and the value of $NPI_t$ is changing every day.

One traditional way to estimate a time-varying value like this would be a Kalman filter, which you mention in issue #19. However a Kalman filter can't be directly applied since it depends on matrixes to express a system of linear-quadratic equations, while our equation of interest has an exponent. I'm thinking maybe a Particle filters/Sequential Monte Carlo simulation is appropriate here? I'm going to try working something up.

I think this, in addition to using 7-day smoothed data, is why the 95% confidence bands are so narrow when there's still a lot of obvious statistical noise.

Nebraska missing a non-lockdown label in Standings section

no_lockdown = [
    'North Dakota', 'ND',
    'South Dakota', 'SD',
    'Nebraska', 'NB',
    'Iowa', 'IA',
    'Arkansas','AR'
]

^ proper abbreviation for Nebraska is NE, not NB. It's listed as NE in all graphs, but doesn't show up as No Lockdown due to this issue

Stability in NY estimates

I'm trying to get a handle on how to compare estimates from day to day. Here is the delta between yesterday's estimate for NYC and today's from the website rt.live.

Yesterday:

Today:

My intuition is that the model is local (modulo the 9 day gaussian smoothing), and forward looking, so adding one day of data shouldn't change the error bars (eg around March 22), nor so dramatically shift the endpoint. A large change in the inferred sigma could cause the expansion of the error bars.

Was there a change in data source?

`highest_density_interval` fails on the widest interval

The highest_density_interval fails to handle the case when the interval to return should be the whole range.

E.g.

foo = pd.Series([1/2, 1/3, 1/6])
highest_density_interval(foo, p=0.9)

fails. The problem is that total_p doesn't have an element corresponding to the total sum of all probabilities. It's due to a vectorized version of off-by-one error.

I'll send a PR with a fix.

Visualize threshold within state (e.g. dot on 25 confirmed cases)

Because R(t) = infection rate/recovery rate * fraction_susceptible, and because the percent of people currently susceptible varies by state's timing within the pandemic, it is helpful to show estimate when the pandemic was established in each state.

Wrong expression on second equation

I think this equation is wrong as Likelihood show be L(k|Rt) not the other way around.

Gaussian Prior vs Gamma Prior

Instead of modelling $R_t$, could we model $\lambda_t$ over time? The idea is that since $k|\lambda ~ Poisson(\lambda)$ we could use the natural Gamma distribution as the prior to the Poisson distribution to quickly update the $lambda_t$'s over time. We could then back out $R_t$ using your relationship of $lambda_t = k_t-1 exp(-1/GAMMA (R_t - 1))$.

This also raises the question of whether or not the Guassian distribution is the natural choice of the prior for $R_t | R_t-1$. It seems like some transformed along the lines of the relationship between $lambda_t$ and $R_t$ of a Gamma distribution would be more reasonable. What do you think?

Question about Rt vs simple rate of growth in cases

I'm just wondering what this gives us that a simple rate of growth in cases doesn't. E.g., the interval ranges seem to be extremely narrow and therefor are not predictive of the future spread. A simple calculation of (total cases today)/(total cases yesterday) seems to be a fairly accurate predictor of case growth over the next few days. I haven't done the calculation, but looking at WA data, this Rt model is out of sync with the actual daily number of total cases. E.g. Rt went high for WA a few days ago, yes actual number of cases has still be growing slowly by about 1-2% a day., lowest of all 50 states except VT.

Generalized Additive Models and SIR definition of reproductive number

Hey guys!

Firstly, thank you very much for sharing your work and raising awareness on Rt. We already forked the code and started or own repository and website for applying Bettencourt & Ribeiro's model to brazilian states: code / website

I was thinking about some issues that you guys and other people raised, such as:

Choice of smoothness function
Calibrating sigma
Low uncertainty

And tried to model Rt with a completely different methodology, using the SIR model definition of the reproductive number:

Rt = (case growth rate) / (recovery rate)

Modeling the case growth rate with a vanilla ML model, such as Generalized Additive Models, and using recovery rate as 1 / (14 days)

Code and deep dive in this notebook

Results are similar, but I'm not entirely convinced that its a good approach. Would love to hear your feedback on this :)

Keep up the awesome work! Thank you very much!

Translation From Confirmed Date to Onset Date May Not Be Correct

I am worried about the translation from confirmed date backwards to onset date in the new Realtime Rt MCMC.ipynb notebook. A few points that are worrisome first:

If I try to translate the "Adjusted Onset" curve forward again to check against the "Confirmed" curve, they do not match (in fact, the forward-translated Adjusted Onset curve is even smoother and flatter than the Adjusted Onset curve---since will have been convolved yet again with a broad kernel).
The broad distribution of confirmed dates given an onset date acts as a smoothing operation that "goes the wrong way." Consider a single burst of onset at day 0 (a "delta function" in the onset): since there are a range of delays between each patient's onset and confirmation, the confirmation curve will be a copy of the delay distribution (scaled by the number of onsets in the burst). (To be technical: the delay distribution is the Green's function for the onset -> confirmation process.). A sharp peak in onset smooths out into a broad range of confirmation dates. So, if anything, the onset time series should be sharper or less smooth than the confirmation time series, but your curves go the opposite way.

I think you should be deconvolving by the onset->confirmation delay distribution (which, in general, is a mathematically ill-posed problem, unless the distribution is of a special class). To make it well-posed, you should be inferring the distribution of onsets conditional on the confirmation observations (by sampling over variables that are onset rates, which get convolved with your distribution of delays, and then matched to confirmation observations), subject to some prior (probably something smoothing like the Gaussian-increments prior you've employed for estimating Rt).

Prior for GAMMA

I really like the general approach, but treating GAMMA (the reciprocal of the serial interval) as a known constant produces posteriors that are too narrow. There's clearly some uncertainty there, as indicated in the cited paper and others estimating the serial interval (e.g., https://www.medrxiv.org/content/10.1101/2020.04.17.20053157v1). It would great to replace GAMMA with a formal prior distribution, perhaps inverse gamma or inverse lognormal, to reflect that uncertainty.

Highest Density Interval

This is just a comment regarding the highest density interval.

If the posterior was calculated using MCMC chains, the pymc3.stats.hpd function from the PyMC3 package could have been used. Although I don't think it'll work for the pmf directly, there might be a way to tweak it.

Feel free to close this without a response.

Feature request: use a binomial distribution to take the number of tests into account

Hi,
Awesome work! Thank you for such a valuable tool!

Since covidtracking.com has data on testing, I added it to the model. I used a binomial distribution instead of a Poisson. This change helps with variability in testing. Here is the result of the Binomial model for NY

As you can see, the model has less variance, and I think it's more accurate. I put this together in a notebook here.

I hope this is useful.

Let me know if you want me to clarify anything, or work on how to incorporate this change into the new pymc3 notebook.

TypeError when ML = 0 in highest_density_interval

Hi,
I use the jhu data to compute the Rt and France seems not to fit :). A test file is attached.
At 2020-04-17 the ML is 0 and the highest_density_interval function shows a Error see below. Hotfix for the highest_density_interval function could be:

...
for i, value in enumerate(cumsum):
      for j, high_value in enumerate(cumsum[i+1:]):
          if (high_value-value > p) and (not best or j < best[1]-best[0]):
              best = (i, i+j+1)
              break

-->if best == None:
         return pd.Series()

  low = pmf.index[best[0]]
  high = pmf.index[best[1]]
...

###################

TypeError Traceback (most recent call last)
~/Projekte/work/covid-19/Realtime-R0_jhu.py in
510 return pd.Series([low, high], index=[f'Low{p100:.0f}', f'High_{p100:.0f}'])
511
---> 512 hdis = highest_density_interval_tmp(posteriors, p=.9)
513
514 most_likely = posteriors.idxmax().rename('ML')

~/Projekte/_work/covid-19/Realtime-R0_jhu.py in highest_density_interval_tmp(pmf, p)
495 # If we pass a DataFrame, just call this recursively on the columns
496 if(isinstance(pmf, pd.DataFrame)):
----> 497 return pd.DataFrame([highest_density_interval_tmp(pmf[col], p=p) for col in pmf],
498 index=pmf.columns)
499

~/Projekte/_work/covid-19/Realtime-R0_jhu.py in (.0)
495 # If we pass a DataFrame, just call this recursively on the columns
496 if(isinstance(pmf, pd.DataFrame)):
----> 497 return pd.DataFrame([highest_density_interval_tmp(pmf[col], p=p) for col in pmf],
498 index=pmf.columns)
499

~/Projekte/work/covid-19/Realtime-R0_jhu.py in highest_density_interval_tmp(pmf, p)
506 break
507
---> 508 low = pmf.index[best[0]]
509 high = pmf.index[best[1]]
510 return pd.Series([low, high], index=[f'Low{p100:.0f}', f'High_{p100:.0f}'])

TypeError: 'NoneType' object is not subscriptable
################

test.csv.zip

Missing License

Thoughts about open source licensing this? I hope yes? I don't mind opening a PR but wasn't sure which license you had in mind, perhaps MIT?

URL for Patient Information Not Working

Hi. It might just be me, but URL "https://raw.githubusercontent.com/beoutbreakprepared/nCoV2019/master/latest_data/latestdata.csv" is not working. It might just be on my end. Anyone else? Anyone have latestdata.csv? Thank you!

Unable to find the specified form.

I'm using Python 3.8.2 on Windows. I fixed this problem when i run data in MCMC Model.

Traceback (most recent call last):
File "c:/Users/nicma/Desktop/PythonScripts/Tesi/main.py", line 362, in
models[state] = create_and_run_model(state, grp.droplevel(0))
File "c:/Users/nicma/Desktop/PythonScripts/Tesi/main.py", line 348, in create_and_run_model
return MCMCModel(name, onset, cumulative_p_delay).run()
File "c:/Users/nicma/Desktop/PythonScripts/Tesi/main.py", line 262, in run
step_size = pm.HalfNormal('step_size', sigma=.03)
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pymc3\distributions\distribution.py", line 46, in new
dist = cls.dist(*args, **kwargs)
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pymc3\distributions\distribution.py", line 57, in dist
dist.init(*args, **kwargs)
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pymc3\distributions\continuous.py", line 835, in init
self.mean = tt.sqrt(2 / (np.pi * self.tau))
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\tensor\var.py", line 233, in rmul
return theano.tensor.basic.mul(other, self)
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\op.py", line 669, in call
thunk = node.op.make_thunk(node, storage_map, compute_map,
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\op.py", line 954, in make_thunk
return self.make_c_thunk(node, storage_map, compute_map,
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\op.py", line 857, in make_c_thunk
outputs = cl.make_thunk(input_storage=node_input_storage,
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cc.py", line 1215, in make_thunk
cthunk, module, in_storage, out_storage, error_storage = self.compile(
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cc.py", line 1153, in compile
thunk, module = self.cthunk_factory(error_storage,
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cc.py", line 1623, in cthunk_factory
module = get_module_cache().module_from_key(
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cmodule.py", line 1189, in module_from_key
module = lnk.compile_cmodule(location)
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cc.py", line 1520, in compile_cmodule
module = c_compiler.compile_str(
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cmodule.py", line 2405, in compile_str
return dlimport(lib_filename)
File "C:\Users\nicma\AppData\Local\Programs\Python\Python38-32\lib\site-packages\theano\gof\cmodule.py", line 317, in dlimport
rval = import(module_name, {}, {}, [module_name])
ImportError: DLL load failed while importing mb63002e4cb3b72a8678617030d8b841c60d1b77d8eb3a2052c9a8bc46863bcd5: Impossibile trovare il modulo specificato.

Can you help me?

Tunisia effective progression rate

Thank you very much for the links and codes. I had to smooth the data ( mostly the zeros in the daily data) to be able to overcome the errors in the calculation of optimums. Simulations give R_0=2.73 for Tunisia and R_t=0.4 as of May 16, 2020.

Feature request: overlay mortality trends (daily increase data)

First, thank you guys for putting all the effort into this. My stats/python is a bit rusty, but if I can be in a position to help, I'll still try my best. I hope to be helpful and I hope to not impede the great progress you're making.

Feature request: The grand theory is R_t is a measure of how well things are being contained. The data would suggest from the "No Shelter in Place" that ND, NE, IA, and AR would be among the worse off (as of this writing, 04/22/2020), although SD is doing pretty well! (just the nature of being spread out I guess). I'd be curious to know, does the death trend match up with that theory?

Color a state yellow if R < 1 and confidence band includes 1

As long as a state's R may be above 1, it is in danger.

So a state should not be "green" and thus safe just because the calculated R is < 1, when uncertainty includes values above 1.

Error Issue

I tried to run the code but when I've come to the [27] it gave me an error

Any idea/suggestion? Thanks in advance

Weekend reporting effect

There is a structural issue in the data because weekend cases are often reported on Monday. You can see the structure in the data. You might get more accuracy if you take it into account.

Are there any legal/licensing restrictions on the usage of this code?

IndexError in var smoothed in get_posteriors

No data in var smoothed with example csv attached.

Hotfix :) :

def prepare_cases(cases):
    new_cases = cases.diff()

    smoothed = new_cases.rolling(9,
        win_type='gaussian',
        min_periods=1,
        center=True).mean(std=3).round()
    
    smoothed_tmp = smoothed
    for i in range(10, 0, -1):
        idx_start = np.searchsorted(smoothed, i)
        smoothed_tmp = smoothed.iloc[idx_start:]
        original = new_cases.loc[smoothed_tmp.index]
        if len(smoothed_tmp) > 0:
            break 
        
    return original, smoothed_tmp

#########
[IndexError ] [Traceback](url
test2.csv.zip

) (most recent call last)
~/Projekte/_work/covid-19/Realtime-R0_jhu.py in
501 # Note that we're fixing sigma to a value just for the example
502
---> 503 posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)

~/Projekte/_work/covid-19/Realtime-R0_jhu.py in get_posteriors(sr, sigma)
472 index=r_t_range,
473 columns=sr.index,
---> 474 data={sr.index[0]: prior0}
475 )
476

~/opt/miniconda3/envs/corona/lib/python3.7/site-packages/pandas/core/indexes/extension.py in getitem(self, key)
207
208 def getitem(self, key):
--> 209 result = self._data[key]
210 if isinstance(result, type(self._data)):
211 return type(self)(result, name=self.name)

~/opt/miniconda3/envs/corona/lib/python3.7/site-packages/pandas/core/arrays/datetimelike.py in getitem(self, key)
512 getitem = self._data.getitem
513 if is_int:
--> 514 val = getitem(key)
515 if lib.is_scalar(val):
516 # i.e. self.ndim == 1

IndexError: index 0 is out of bounds for axis 0 with size 0

Smoothing might be done wrong

First off: this is an absolutely kick-ass and necessary project. Thank you so much for this, and for the transparency that you're putting behind this.

I have two big concerns with this initiative. The first — about color coding on the website — is addressed in a tweet here, and the obvious fix would be to adopt a color scheme on the website similar to the one used in the Matplotlib plot_rt() function, which is not nearly as binary.

The second concern is about smoothing (or rolling average, or time series gaussian filter, which are used interchangeably here) below.

The problem with centering the rolling average

Try executing this code in your notebook, first with line 17 (center=True) present, and then commented out.

Slightly modified notebook cell (toggled for space)

state_name = 'NY'
# We will compute the smoothed lines, and the resulting R_ts, 6 times
# First with the data up to the latest day, then up to one day before 
# that day, and so on...
day_lags = [0,1,2,3,4,5]

# Using a different name to avoid interfere with prepare_cases()
def prepare_cases_new(cases, cutoff=25):
    # Unrelated note: new cases data are available in the COVID
    # Tracking Project in the column "positiveIncrease", so this line
    # could go, in theory.
    new_cases = cases.diff()

    smoothed = new_cases.rolling(7,
        win_type="gaussian",
        # The offending line is below.
        center=True,
        min_periods=1).mean(std=2).round()
    
    idx_start = np.searchsorted(smoothed, cutoff)
    smoothed = smoothed.iloc[idx_start:]
    original = new_cases.loc[smoothed.index]
    return original, smoothed


def prepare_cases_wrap(day_lag):
    lagidx = -day_lag if day_lag != 0 else None
    # We skip cases that are after our cutoff (or lag)
    cases = states.xs(state_name).rename(f"{state_name} cases")[:lagidx]
    # The new function is inserted here
    original, smoothed = prepare_cases_new(cases)
    return {"original": original, "smoothed": smoothed}

data = list(map(prepare_cases_wrap, day_lags))

original = data[0]["original"]
original[-21:].plot(title=f"{state_name} New Cases per Day",
               c='k',
               linestyle=':',
               alpha=.5,
               label='Actual',
               legend=True,
             figsize=(500/72, 300/72))

for day_lag in day_lags:
    smoothed = data[day_lag]["smoothed"]
    smoothed[-21+day_lag:].plot(
                   label=f"Smoothed with {day_lag} day lag",
                   legend=True)

fig, ax = plt.subplots(figsize=(600/72,400/72))

# Display the Rt history with the different time cut-offs
for day_lag in day_lags:
    smoothed = data[day_lag]["smoothed"]
    posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)
    hdis = highest_density_interval(posteriors, p=.9)
    most_likely = posteriors.idxmax().rename('ML')
    result = pd.concat([most_likely, hdis], axis=1)
    plot_rt(result, ax, state_name)

ax.set_title(f'Real-time $R_t$ for {state_name}')
ax.set_xlim(original.index.get_level_values('date')[-1]+pd.Timedelta(days=-21),
            original.index.get_level_values('date')[-1]+pd.Timedelta(days=1))
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

Note: the graphs below only show the 3 most recent weeks, for clarity's sake. There is no change to the underlying computation otherwise.

That code will first produce the following graph:

Each colored line represents the smoothed cases made with data up to [5, 4, ..., 0] days before the latest data. The issue here is that they lines quite wildly.

For instance, here is what the model says about the amount of new cases on April 16:

On April 16 (3 day lag), the model says that there are 8930 that day, an increase of 292 from April 15.
On April 19 (0 day lag), the model says that, in fact, there were more like 8187 cases that day, a decrease of -175 from April 15.

Code to find the precise figures (click to toggle)

for date in [16, 15, 14]:
    print(f"For April {date}:")
    for x in range(6):
        try:
            new_cases = data[x]["smoothed"][f"2020-4-{date}"]
            diff = new_cases - data[x]["smoothed"][f"2020-4-{date-1}"]
        except KeyError:
            continue
        else:
            print(f"with {x} day lag: {int(new_cases)} new cases. change: {int(diff)}")
    print(f"")

In other words, the model is re-writing history every day. That's not great for something that is meant to be tracking Rt incrementally. Especially since, in turn, the estimates for Rt vary wildly as well:

Note: which line corresponds to which time cutoff / lag should be pretty self-explanatory.

So, someone checking rt.live:

On April 14 (5-day lag): Everything is under control in NY state:
- Most likely Rt is 0.55.
- The 80% CI is 0.40 - 0.66, so anything above 1 is very far-fetched.
With the data for two additional days (3-day lag):
- Most likely Rt for April 14 is now short of 1, at 0.97, something that the model deemed outside of the 80% CI just three days prior,
- Now, the 80% CI (0.82 - 1.07) puts the previous value, 0.55, outside of it. Hell, it's even outside of the 95% CI, which I calculated: is 0.79 - 1.12

Same story for Rt for April 16:

the day of (3 -day lag):
- Most likely Rt at 1.23
- 80% CI: 1.09 - 1.33 (for the record: 95% CI is 1.06-1.38)
On April 19:
- Most likely Rt for April 16 is 0.86
- 80% CI: 0.72 - 0.97 (for the record: 95% CI is 0.68-1.02)
Similarly, both numbers, 3 days of each other, diverge way in a way that makes their CIs look meaningless.

Code for the confidence intervals (click to toggle)

for day_lag in day_lags:
    smoothed = data[day_lag]["smoothed"]
    posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)
    # Use .975 for 95% CI, .9 for 80%, .995 for 99%, etc.
    hdis = highest_density_interval(posteriors, p=.975)
    most_likely = posteriors.idxmax().rename('ML')
    result = pd.concat([most_likely, hdis], axis=1)
    try:
        print(day_lag, result.loc["2020-4-14"]) # or "2020-4-16"
    except KeyError:
        pass

I don't really think you can run a model that updates every day like that. Think about the models from 538: when you look at them every day, you would expect the latest numbers to change because new polling has come in. What you would not expect is the previous numbers to have changed as well, unless the underlying model had been revised overnight.

I think that's the impression that many readers will have looking at the website day to day, and see data from previous days change all of a sudden. They will think that the model is massively tweaked all the time, when in fact it is working as intended.

The perks of not centering the rolling average

The solution is simple: remove the center=True line from the previous code. These are the results:

What changes in the COVID Tracking Project data every day at 4PM is not the data from previous days, merely the data from that specific day. The smoothing line should behave accordingly, and does so here.

As far as the Rt calculations are concerned, they are also much more stable. (I haven't done the math, but I obviously suspect that confidence intervals might have been incorrect before, and are now.)

Note: As the code listed at the top of the comment makes clear, I did not modify the sigma across runs. #23 references a change in sigma as something that might modify the curves, but it seems at first look that the impact is more minor than centering the rolling average.

Alternative

I'm not an epidemiologist, or a statistician by trade. Maybe centering the rolling average is the right thing to do, but the inevitable result is that Rts will change in a way that will be incomprehensible to readers that expect this to be an incremental dashboard.

This all the more that you're using a Bayesian approach, so it's not really great to revisit priors' priors. But maybe I'm wrong! I didn't see any formal justification of the centering though. If you do decide to go with the centering, however, then please freeze the history.

In other words, make sure that the Rt generated for a state on April 16 will be the same whether you're checking out rt.live on April 16, 17, 18, 19, and so on. That would make the charts even more unstable, and CIs would have to be revised, but this is better than seeing the history of Rts change drastically everyday.

Appendix: more results

Other results with no centering

For reference, same images generated with `center=True`

Please add creation of data directory

# Bulletproofing if data directory isn't already present
os.makedirs(os.path.dirname(LINELIST_PATH), exist_ok=True)

Other real-time tracking of R_t

It's really great to see someone tackling this for the US, but there is another project tracking R_t in real time using Bayesian updating, in Europe: https://mrc-ide.github.io/covid19estimates/#/ Their model is more structured, but that has pros and cons. I really like that your model is less constrained with regard to how R_t changes over time, as behavior can change rapidly as people become more cautious, start wearing masks, etc. But you might get some other ideas about modeling from their approach.

Likelihood Notation

Excellent analysis!

The likelihood notation is a bit confusing. ℒ should indicate the "likelihood" of observing the data, given the model parameters, not the other way around. If I understand correctly, in this case, the observed data is k and the model parameter is R_t so the likelihood should be noted as ℒ (k | R_t). This is used correctly where the Poisson model is introduced, but not elsewhere.

Method for correction based on testing number

Hello there, great work. I was wondering if you could share your code or methodology for correcting for the number of tests performed in adjusting the number of cases. We are doing some similar work and are trying to compare methods used for this. Thanks! @bbolker

R_t cannot be inferred using the PyMC3 implementation of the generative model

Hi, thanks for making this work public, it is super informative and helpful.

I would like to point out to that the PyMC3 example cannot recover R_t, as the gamma is not identifiable parameter. Changing the prior parameters for gamma also changes the estimates of R_t.

The reason for this is as currently defined, gamma has no influence on \theta and hence observation likelihood. In other words, the prior over gamma and the posterior will be identical. A fix would be to define random walk directly for R_t, for example

R_t = R_{t-1} + \sigma n_t

and then map this value to theta_t,

\theta_t = \gamma * (R_t - 1)

this way the likelihood becomes actual function of \gamma, and R_t, and both gamma and R_t are identifiable.

Alternatively, if you are worried about the negative values for R_t, one can constrain R_t to positive values with either softplus or exponential transform, e.g.

x_t = x_{t-1} + \sigma n_t
R_t = f(x_t)

Color scheme.

Nice work!

Not sure if you are aware, but your color scheme was changed on a commit somewhere between looking at this yesterday, and right now.

Visualize policy changes by state

E.g. light vertical lines, color ranging from yellow (no gatherings>10) to red (retail closed) to green (businesses open). Seeing these might help people consider how these measures are working.

Issues When Adapted to Smaller Datasets

When attempting to adapt the code to run for localized data of two locations, the Gaussian filter cutoff rips out data from the location with small numerical changes in cases, making Rt impossible to compute. I continue to get the following error, regardless of the parameters tweaked in attempts to compensate for the smaller data set:

/home/nbuser/anaconda3_501/lib/python3.6/site-packages/ipykernel/main.py:52: RuntimeWarning: divide by zero encountered in log

For def prepare_cases, I had to set cutoff=1. Any larger value results in no data for the state 'FS' in the data set. The larger state 'TC' does not experience this problem, and behaves normally with cutoff=2.

Any permutation of modifications to center, std, and min_periods parameters inside new_cases.rolling() does not rectify the situation.

Is there any way around this? Is the Gaussian filter not suitable for this small dataset? What I don't understand is I was successfully able to run the code against data that I had on Friday. Unfortunately, I am unable to attach a Jupyter Notebook with my code. Thanks for any help you can provide!

What is the impact if actual cases are several multiples of confirmed cases?

Outstanding job on this projection and tracking! This nearly brought tears to my eyes and I literally said "YES!" when I read it because so many people have been abusing this kind of model:

"Yet, today, we don't yet use $R_t$ in this way. In fact, the only real-time measure I've seen has been for Hong Kong. More importantly, it is not useful to understand $R_t$ at a national level. Instead, to manage this crisis effectively, we need a local (state, county and/or city) granularity of $R_t$."

I've been saying this for weeks! Great to see someone express it with good numbers.

I've been tracking rates of growth for both new cases and deaths. I realized last night and am now absolutely certain that the rates of growth for new cases has been entirely bounded by the number of tests we can conduct per day. National rates of positives have been locked in right at 20%. We're about 700k current confirmed cases today. If I want there to be 1M confirmed cases I just need to conduct 1.5M more tests and I guarantee that I'd have that number. Right now at 150k/day capacity it'll happen in 10 days (effective yesterday). If we doubled our capacity to 300k/day we'd see it in five days.

What's happened is that, somewhere around March 19, the rates of infection overtook our capacity for daily tests so there's no telling how many actual cases are now out there. Until that is corrected our confirmed cases numbers are absolutely, to use the technical term of art, bogus. The Standford study seems to confirm this opinion.

Soo.... let's say (for example) actual incidents have grown since March 19 to 10x the confirmed incidents. Does that make Rt higher or lower and by how much?

Anyway - outstanding job and thanks for the excellent work.