GithubHelp home page GithubHelp logo

cipriancraciun / covid19-datasets Goto Github PK

View Code? Open in Web Editor NEW
25.0 12.0 5.0 769.32 MB

COVID-19 derived and augmented datasets (based on JHU, NY Times, ECDC) exported as JSON, TSV, SQL, SQLite DB (plus visualizations)

Home Page: https://scratchpad.volution.ro/ciprian/eedf5eb117ec363ca4f88492b48dbcd3/

Python 54.92% Julia 15.25% JSONiq 1.64% jq 28.19%
covid-19 covid-2019 2019-ncov data-visualization ecdc jhu-dataset ny-dataset sql sqlite

covid19-datasets's People

Contributors

cipriancraciun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid19-datasets's Issues

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues...

That issues are tiny bugs, at most, but maybe indicate an issue with your automatized adaption of the JHU-files. I came across this when I saw that the "Aruba"-data in your "combined-values" seem to have not been optimally adapted. The JHU-originals have on two days double-reporting but different confirmed-values. Also the data-record from 11.03 seems to be missed in your CIP -Dataset.
See the inspection-forms parallel for your CIP-data and for the JHU-data.
aruba_base The upper-right form is the inspection-tool for the JHU-serial-data and the lower-left form is on your "combined"-data. In the JHU-serial-data-form I've also documented the uncorrected original country-province-date entried for better reference and detection of errors in conversion process itself. They are also arranged per daily file (see column "filenr"), and this shows, that the same record has been repeated to more sequential files by JHU.

First problem occured with data-record of 11.03. This does not occur in your dataset at all. Maybe a bug in your import-routine? see here:
aruba_11_03

Next is the data of 18.03. In JHU Aruba is now documented two times, and even with different values (2 and 4) in the same daily file! You've combined that to get 6 cases, which might be sensical. But at the next file JHU simply repeats that double-reporting, but from day (20.03) they combine. But now they increase either from 2 or from 4 to 5, and you get now the same number 5 correctly ---- but that means, in your dataset the numbers decrease from 6 to 5 - which gives errors in graphs with logarithmic scales!

aruba_18_03

This should, at the moment, be only small reminders, I'm not in a intense verifying/checking process. Also, this stems from you data from 25.03 and might not occur in more current datasets.

If I find something more like this I'll add observations here.

Populate per100k in US County-level data

Hi,

First, thank you for what you are doing. I was about to start something similar and then realized it's already been done! Fantastic!

For the US county-level data in NYT and JHU, the per-100k fields are NULL.

Over at https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv, this population data can be found at the county level. The FIPS code would be a suitable primary key for cross-referencing into it.

Thanks!

About updating data

Good Morning. My question is about thanks and not any other type of error. For a long time I scanned the internet for a reliable source and a database form to use in my web application and now I finally met you. I am very happy to have found this repository. I would like to know if there is a possibility that someday you will not be updated or if you do not want this work anymore, because I am using your data grouped for my analysis and it would be very good to be able to count on this incredible update from you. You have my admiration. This work you did is fantastic and here in Brazil it will help us a lot in the analysis of covid19. When I finish this job how do I reference your repository? Ah you can close this issue when you read. A big hug.

Script tool for correcting province/country notation errors/ambiguities

Due to the immense data inconsistencies on referencing [country]'s and/or [province]'s I've made a scripting tool for easy defining corrections on the fields of the JHU-daily files. If I see misspellings in the [country] or [province] fields I can simply add the reference and a correction into the script and re-work the combined JHU-file with the updated script into a tsv-file (readable for instance directly by msaccess, excel).
"commands" in the scriptfile are simple lists:

// Commands for redefinition/standardizing of Country/province-notations
//       Syntax-help at end of file 
replace f1    // delete entries for province/state (=f1)
              ("None","Bavaria" = "")
         .
replace f2     // standardize entries for Country/region  (=f2)
           ("Mainland China","*Hong Kong*","Macao SAR","Macau" = "China")
           ("Viet Nam" = "Vietnam")
           ("Czech Republic"="Czechia" )
           ("Republic of Ireland" = "Ireland")
           ("Republic of Korea","Korea, South" = "South Korea")
           ("Republic of Moldova" = "Moldova")
           ("Cabo Verde" = "Cape Verde")  // don't know whether this should be done
           ("*Gambia*" = "Gambia")            // standardizes "The Gambia" and "Gambia, The"
           ("Holy See" = "Vatican City")
           ("Iran*" = "Iran")                 // delete "islamic republic"
           ("Russia*" = "Russia")             // delete "federation"
           ("occupi*" = "Palestine")          // delete "occupied..."
           ("*Bahamas*" = "Bahamas")          // "The Bahamas", "Bahamas, The"
  .
 // working at two fields at once: AND condition for pattern-testing, moving from one to the other field
replace f1;f2    // move entries for province/state (=f1) into [country] (=f2), nullify [province]
              ("Denmark","France","Netherlands" = "";$)
             .
replace f1;f2      // standardize entries for province (=f1) and for country (=f2) depending on this
            ("Cruise Ship","Diamond Princess" = "Diamond Princess cruise ship";"(Others)")
            ("Grand Princess" = "Grand Princess Cruise Ship";"(Others)")
            .

The basic idea was for my own needs, so it is incorporated in my translation-tool from JHU-files into (readable) tsv-files (with simpler quote-rules for string fields).
So far my reformulating-script is made on observations on typos/errors/mislocation up to the JHU-03-25.csv file. If someone is interested to use this, I make it available for everyone - it is a Windows/Delphi32-application, and I think of it as a free tool.
After I have the scripting tool so far, I've more ideas how to evolve, but I am interested in exchange with possible users (and of course have to overcome the experimental- and alpha phase...).
The cool idea with this is, that the script can be refined in a collaborative manner (as well as I can expand the script-language and -concept as needed).

A inspection tool (in msaccess) helps to find/locate/correct inconsistencies whose resolve can then be incorporated into the script. See for instance the desktop at the moment where I inspect the dtat-check
Daatacheck
for the entries for Canada and the changes of naming the provinces from day-to-day. The province-names are adapted by the script already, but we see, that the use of "Alberta" "Calgary, Alberta" and "Edmonton, Alberta" is inconsistent (the same with "Ontario" and so on). To formulate some new script-command for the resolve of this it helps that I appended the original field-contents before correcting at each record. The datafield "Filenr" refers to the daily JHU-file and gives sorting order and with the "Last update"-information helps to identify doublettes.

At the end of this post I've attached the current state of the script. For better readability all comments may be removed (comment: from "//" towards the end-of-line can be deleted)

I'm new to github and don't know about good methods of communication here. You can always use my email helms (at) uni-kassel.de

current scriptfile
recode_seqfile_script.txt

Please restore combined datasets (perhaps as releases)

Hi,

I understand the issue with the combined dataset. However, this was very helpful.

https://docs.github.com/en/github/managing-large-files/distributing-large-binaries discusses options for this. Over at https://github.com/jgoerzen/covid19db I generate a dataset that aggregates from yours and some others. I use a Github Action that automatically publishes it to Github as a release; see https://github.com/jgoerzen/covid19db/blob/master/.github/workflows/build.yml

It should be fairly easy to do that here.

Also, locations-diff.tsv is important for some analysis, and it would be nice to see it back as well.

Thanks,

John

Round computed values to 3 significant digits after the decimal point

At the moment computed values are exported with all the decimals JavaScript is capable of. However all that "precision" is pointless, given that how "fuzzy" the data is, and only "clutters" visually the exported files.

Therefore rounding all values to 3 significant digits after the decimal point would be better.

Used by bogdanvso / diseases_risk_analysing and thanks

I'm using your derived data for one my personal pet-project (www.covid-info.live). It provides world/country statistics, top-3 by major metrics, some basic analytics related to medical resources situation (e.g. correlation between number of beds in hospitals and Case fatality rate) and calculates approximate risk to be infected and to die in a given country relatively to a person age/comorbid diseases/medical environment.

I've added your repo link to the site's disclaimer and into my repo
Thank you for your work!

Used by jojo4u/covid-19-graphs-jo and thanks!

I'm using your derived data in my little python graph project. It shows cumulative and daily cases/deaths per capita since in the beginning everybody only reported absolute data. If you're interested you can add https://github.com/jojo4u/covid-19-graphs-jo to the "Used By" section of your readme (I figured a pull request would be a bit overkill for such a small change).

I want to give you a big "thank you" for your effort to consolidate and augment the CSSE data. At first I used https://github.com/datasets/covid-19/ but since the infamous data change at CSSE it missed US state data. Your jhu/daily data includes it and also has population :)

JHU dataset doesn't contain the last day data

Hi,
There is 2020-04-23 as the last date in jhu dataset, at the same time JHU provides the data about 2020-04-24 at its GitHub and website
Please, do force data update/fix updating schedule
Thank you

Integrate ECDC dataset as alternative to the JHU dataset

ECDC (the european CDC counterpart) has published a dataset (at country level), which might be a good alternative or cross-check for the JHU dataset:

The actual dataset link in CSV:

(There is also an JSON and Excel one, but given the current workflow, the CSV is the best alternative.)

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's)

[Also in the attention of the following users that have forked my repository at various points: @amirunpri2018, @Dithn, @elektrotiko, @hmpandey, @jgoerzen, @rafaelsabino, @sbw78, @stillnotjoy.]


Update: at the moment all the original, intermediary and derived files (and plots) are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/); see the readme in the project for details.


Unfortunately early in April I've hit the GitHub 100 GiB repository limit. This is with all my effort to compress the files (with a git friendly, i.e. "synchronizable", tool like gzip --rsyncable or zstd --rsyncable), and with all my hope that the rest of the "text-only" files would compress nicely with git's own packing algorithm (based on deltas).

Thus, in order to fix the issue, and start re-generating the datasets, I had to take the following measures:

  • I've rewritten the history to remove all output files (binary or text), all with the exception of status.json that contains only the latest values;
  • I've also removed the plots, that changed quite dramatically on each regeneration; (and thus didn't pack nicely;)
  • (none of these files will be added in the future to this repository;)

However, in the next couple of days I'll republish the output files outside of GitHub, and I'll link them in the readme.

Thus this repository will contain only:

  • the sources and scripts to process and augment the data;
  • the input files as found on JHU / NY Times / ECDC repositories; (I've opted to keep these in case the original sources are changed or dissapear; the output files can always be re-generated; these files can't be recreated once they dissapear;)

Moreover, because there are a couple of forks to this repository that contain the old history, and because that still causes troubles with GitHub due to the excessive repository sizes, I would kindly ask those that have forked my repository to either remove their forks, or to reset their histories (and push to their GitHub fork) to the current master (that holds the cleaned history).

If anyone needs help with how to reset their forks, please comment on this issue, and I'll provide some snippets.

Thanks, and sorry for the trouble (both to GitHub and the fellow users that have forked my repository)!

Introduce (approximate) *cohorte-views* for deaths-rates?

Dear Ciprian - I was thinking about the change-of-deathrates (deaths/confirmed) in a couple of inspected countries.

Taking your rich timeseries dataset (applause for it) I tried various parameters for the mean individual test-to-death lifespan.

Thus in an excel-file I computed the ratios from the accumulated files [deaths]_{d+lag} / [confirmed]_d where d is the day-index and lag the estimated lifespan from the event of being tested to the event of deaths First, for Germany I got a stabilizing ratio of ca 6% when the lag was about 12 days, and for Italy of ca 24% when the lag was about 5 days.

These are crude guesses from inspection of the curves when the lag parameter was changed and the most horicontal / constant form of the curve was selected as most appropriate.

Cohorte_Germany

Cohorte_italy

It is a bit laborous to reconfigure this in excel for various more countries and changeable parameters plus making visually parallel curves having neighboured parameters for lag by hand. So after this little peek-into-data I've paused so far.

In short: the idea behind is, to define such best lag parameter and from this estimate the country-specific deathrate-in-cohortes where the cohortes are estimated via the lag, meaning the (average) lifespan after (country-specific governmentally granted) testing-for-corona-positivity.
This felt to be a very good idea in principle, and lacks the true cohort-determination. But possibly extendable with new ideas and data.

I tried to improve the meaningfulness of such estimates by using the daily-change versions of the data instead of the accumulated data, and moreover by using the infected instead of the confirmed values for the lagged day-indexes. While the occuring tendencies seem to suggest the same best lag parameter, the curves are more oscillating, and thus possibly more irritating the reader. Again: maybe here could some more insight be helpful by checking more countries and see whether this gives indeed some approximately consistent rules for a general approach ...

Add `how to cite` and `used by` section to the readme

As the title says, add two sections to the readme:

  • "how to cite" -- giving short snippets on how to cite this repository and the original work;
  • "used by" -- giving a list of sites that use this derived dataset;

Observations on `v1-*.tsv` -- proposal to repair inconsistencies

Hi Ciprian -
I've downloaded your nice file and have imported it into Excel and Msaccess.
With the query-tools in msaccess I've looked a bit at consistency of country-/province-issues that I've seen already in the JHU-original datasets. Something like

  • first four JHU files document under "Germany";"Bavaria" and later files omit the [province] although the records are clearly continuations of the first four. This has went into your dataset. "Bavaria" should simply be nulled to allow all the "Germany" records are under the same key
  • "Aruba" is found in [country] but also is found under [country];[province]="Netherlands";"Aruba" . This should be made consistent
  • And some more. (-UK; -St Barthelemy with and without "France", different char-code in first "e"; -...)

Have been just peeking into the dataset, no rigorous protocol so far. If you don't think this is of concern for the real use of the dataset I'm fine with it, but I can also try to contribute a more detailed protocol of (possible/guesses) issues.

Sqlite3 schema suggestions

Hi,

There are a few schema suggestions I would like to make.

  1. When the values in the absolute_* and delta_* columns are zero, put a zero there rather than a NULL. Both columns should be defined NOT NULL in the schema.

  2. Removing the FIPS codes from the county-related rows makes it more difficult to cross-reference with other sources. It would be great to have those still present, or a separate locations table that could be cross-referenced using the location_key to look it up.

  3. A more relational model could be useful; for instance, there wouldn't necessarily be a reason to duplicate all the factbook entries on every row when they could be correlated with a separate factbook table at select time.

On another topic, there doesn't appear to be a row for every day in the table, although this is present in the source material. It appears rows are omitted when all of the delta_* values would be zero. Although that reduces storage space, it makes it substantially more difficult to perform analyses using WHERE date = type of clauses. For instance, creating a sum of cases over a set of counties or something would normally be possible using a SELECT but couldn't be done here, since the rows are omitted.

Change country order and colors in visualization based on number of confirmed cases

At the moment the order of countries in the plots is "hard-coded". However when there are more than a few countries, the colors start to "seem" similar.

A solution would be to order the countries based on total confirmed cases, and assign colors based on that order. (Therefore the countries would form a gradient from "red" the hardest hit, to "blue" and "purple" the less hit ones.)

nytimes data no longer updating

Hi,

Thanks for this resource! It looks, by the way, like the New York Times data is no longer updating. Last update was 4 days ago. Are you able to restore that data?

Thanks!

Add simple "current situation" file with only the latest values per each location.

From: CSSEGISandData/COVID-19#1250 (comment)

Unfortunately it looks way too complex for my needs and like I said, I'm looking for something very simple aka: countryName, totalCases, totalDeceased and totalRecovered.

From: CSSEGISandData/COVID-19#1281 (comment)

Thought I would submit a request here if I may.
Is there any way to create a simple JSON dataset along the lines of, for example
"country": "Australia"
"cases": 4860
"deaths": 20
"recovered": 244

Replace all `values.json` and `values.tsv` files with `gzip`-ed variants

At the moment many of the values.* dataset files are approaching the 100 MiB limit which is enforced by GitHub.

All these files will be committed only in compressed format from now on.


As a consequence the following actions have been taken with regard to the exports folder:

  • replaced values.json and values.tsv with values.json.gz and values.tsv.gz; (i.e. gzip-ed variants;)
  • removed values.tsv and status.json from the root of the JHU exports folder; (these are in fact copies of the daily files of the same name;)
  • removed all values.txt files; (they were mainly used for internal debugging, and the values.tsv contain exactly the same data;)

For the moment the values.tsv files are still kept in the exports folder, but will soon be removed in favor of the values.tsv.gz.

Add united kingdom?

Thanks for the very informative graphs. Is it possible to get the UK added to the graphs?

Thanks

Add support for SQL file ready for ClickHouse

In addition to #13 and #14 which adds support for most common SQL relational databases, ClickHouse is another useful SQL-like database that is especially well suited for quick queries and statistical processing.

Add `day_index_peak_*` and `peakpct_*` metrics

For each of the four metrics (confirmed, recovered, deaths and infected) compute the following two values:

  • day_index_peak_* which computes for each row, how many days before or after the maximum value of that metric this row is;
  • peakpct_* which divides the current value to the peak value;
  • the peak value should be computed as daily delta towards the previous day, not the cumulative value; (else the peak day would be always the last day;)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.