GithubHelp home page GithubHelp logo

reichlab / covid19-forecast-hub Goto Github PK

View Code? Open in Web Editor NEW
443.0 25.0 326.0 13.15 GB

Projections of COVID-19, in standardized format

Home Page: https://covid19forecasthub.org

License: Other

R 2.40% Shell 0.21% JavaScript 3.08% Python 1.97% HTML 9.29% Vue 0.66% CSS 0.74% TypeScript 0.41% Dockerfile 0.01% Jupyter Notebook 81.10% SCSS 0.12% Makefile 0.02%
covid19 forecasts covid-19 forecast-data covid-data github-pages visualization analytics

covid19-forecast-hub's Issues

Separate forecasts from truth

I suggest we reorganize the data so that forecasts are separate from truth, e.g.

data-raw/forecasts
data-raw/truth
data-processed/forecasts
data-processed/truth

The subdirectory structure within the forecasts/ subdirectories would be the same as it is now.

Also, perhaps we should include nytimes "gold-standard" data in addition to the JHU data.

add additional validations

Some additional validations

  • ensure that we are checking for all required column names as required by the repo (right now we are requiring forecast_date and target_end_date which are not part of Zoltar) can we require these?
  • are we validating the FIPS locations based on the specific set of valid numbers, or just any string of a number between 01 and 95? I would prefer the former, so we are doing it specifically for accepted FIPS.
  • can we institute a more complex check to ensure that people are aligning forecast_date and target_end_date correctly? I will explain more below.
  • Require point estimates (exactly one point estimate per location/target tuple) - we know from Katie's code that the forecast_date column is the same for the entire file (based on filename)
  • update https://github.com/reichlab/covid19-forecast-hub/wiki/Validation-Checks

Move processing scripts to data-raw/ folders

Currently most of the code is in the code/ directory and recently organized into subdirectories. As a general principle, I suggest we move code closer to the data it is used on. For example, I suggest we move raw data processing scripts to the data-raw/ folder.

The code/ directory could still be used for functions (rather than scripts) that are used in multiple scripts.

Standardize processed data filenames

In addition to the missing fields in #66, the newest MOBS processed data file has a filename that is non-standard (for me) and is causing me issues with reading processed data for the shiny data-processing app.

Although I could update the data reading script, I think the real issue is that we don't seem to have a standardized filename processed data. I was assuming "-" was a reserved character such that the files are named

YYYY-MM-DD-team-model.csv

Can we set this as a standard?

Add forecast_date, target_end_date to 2020-04-13 CU data-processed/ files

The following files need required fields forecast_date and target_end_date:

data-processed/CU-60contact/2020-04-13-CU-60contact.csv
data-processed/CU-70contact/2020-04-13-CU-70contact.csv
data-processed/CU-80contact/2020-04-13-CU-80contact.csv
data-processed/CU-nointerv/2020-04-13-CU-nointerv.csv

update target list

what are the next-phase targets that we want to include? likely we should phase these in slowly, to reduce strain on creating checks, visualizations, ensembles, for new targets. candidates are:

  • incident hospitalization demand by week/day?
  • ICU bed demand by week/day?
    ...

migrate to a clearer structure for what forecasts are made when

There are two competing priorities here:
(1) record all (or nearly all - do we really want to store every update, even if daily?) forecasts made by teams, as they make them [useful for "tracker"-like sites that want all versions and real-time updates]
(2) record forecasts made by teams that are available at a specific time, and use them to build an ensemble. realistically, for the foreseeable future we might just want to update the ensemble once a week. [useful for our standardizing our ensemble]

Here is one proposal for how to do this:

  • we have the data-processed directory contain all (or nearly all) forecasts from each team. no restrictions on when these forecasts are submitted.
  • each file is marked with the date the forecast was made. This would change a bit our restriction right now that these YYYY-MM-DD's only refer to Mondays. I'm going to refer to this date in the filename as fcast_date below.
  • we set really clear guidelines for when "1 wk ahead" means epiweek(fcast_date) and when it means epiweek(fcast_date)+1. for example, we say that if weekday(fcast_date) is Thursday, Friday or Saturday, then "1 wk ahead" means epiweek(fcast_date)+1 and otherwise epiweek(fcast_date). (I don't feel that strongly about where the threshold is for switching over. Could be Tuesday, could be Thursday.)
  • to reinforce this and avoid inadvertent errors in assignment of targets to days/weeks, we could also accept a new column name in the files that would be end_date, so files submitted with fcast_date of 2020-04-23 (thursday of EW 17) or 2020-04-27 (Monday of EW 18) would both have a "1 wk ahead" forecast with end_date of 2020-05-02 (Saturday of EW 18).
  • on Mondays at a fixed time (6pm ET?) we run an ensemble script that finds all available forecasts from a team made since the preceding Thursday (i.e. 4 days prior) and takes the most recent forecast to include in the ensemble.

Remove Imperial ensemble forecast files from data-raw/ folder

All team forecasts should be in a subdirectory of data-raw/, but these files

https://github.com/reichlab/covid19-forecast-hub/blob/master/data-raw/2020-04-19-Imperial-ensemble1.csv
https://github.com/reichlab/covid19-forecast-hub/blob/master/data-raw/2020-04-19-Imperial-ensemble2.csv

are directly in the data-raw/ folder.

I would create a pull request, but these files have differences to the files in the data-raw/Imperial subdirectory, so I'm not sure which versions should be preserved.

Add list of teams to ReadMe?

Add a subsection enumerating teams / sources of forecasts we are planning to include + links to their repositories or websites

Write plausibility checks

Write a script that does some plausibility checks for cleaned data, eg:

  • no quantile crossing
  • quantiles for cumulative deaths greater or equal than those for incident
  • quantiles for cumulative deaths non-decreasing over time
  • cumulative week-ahead and corresponding day-ahead forecasts coincide
    Maybe related to #13 ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.