cipriancraciun / covid19-datasets Goto Github PK

COVID-19 derived and augmented datasets (based on JHU, NY Times, ECDC) exported as JSON, TSV, SQL, SQLite DB (plus visualizations)

Home Page: https://scratchpad.volution.ro/ciprian/eedf5eb117ec363ca4f88492b48dbcd3/

Python 54.92% Julia 15.25% JSONiq 1.64% jq 28.19%

covid-19 covid-2019 2019-ncov data-visualization ecdc jhu-dataset ny-dataset sql sqlite

covid19-datasets's People

Contributors

Stargazers

Watchers

Forkers

elektrotiko hmpandey stillnotjoy amirunpri2018 rafaelsabino

covid19-datasets's Issues

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues...

That issues are tiny bugs, at most, but maybe indicate an issue with your automatized adaption of the JHU-files. I came across this when I saw that the "Aruba"-data in your "combined-values" seem to have not been optimally adapted. The JHU-originals have on two days double-reporting but different confirmed-values. Also the data-record from 11.03 seems to be missed in your CIP -Dataset.
See the inspection-forms parallel for your CIP-data and for the JHU-data.
The upper-right form is the inspection-tool for the JHU-serial-data and the lower-left form is on your "combined"-data. In the JHU-serial-data-form I've also documented the uncorrected original country-province-date entried for better reference and detection of errors in conversion process itself. They are also arranged per daily file (see column "filenr"), and this shows, that the same record has been repeated to more sequential files by JHU.

First problem occured with data-record of 11.03. This does not occur in your dataset at all. Maybe a bug in your import-routine? see here:

Next is the data of 18.03. In JHU Aruba is now documented two times, and even with different values (2 and 4) in the same daily file! You've combined that to get 6 cases, which might be sensical. But at the next file JHU simply repeats that double-reporting, but from day (20.03) they combine. But now they increase either from 2 or from 4 to 5, and you get now the same number 5 correctly ---- but that means, in your dataset the numbers decrease from 6 to 5 - which gives errors in graphs with logarithmic scales!

This should, at the moment, be only small reminders, I'm not in a intense verifying/checking process. Also, this stems from you data from 25.03 and might not occur in more current datasets.

If I find something more like this I'll add observations here.

Populate per100k in US County-level data

Hi,

First, thank you for what you are doing. I was about to start something similar and then realized it's already been done! Fantastic!

For the US county-level data in NYT and JHU, the per-100k fields are NULL.

Over at https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv, this population data can be found at the county level. The FIPS code would be a suitable primary key for cross-referencing into it.

Thanks!

Exclude the "pink" hue from line charts (it can be easily confused with red)

In line charts, the hue of lines should be from red to blue (passing through yellow and green). Eliminating pink solves the issue of mistaking lines by red in various graphs.

Add Julia plotting server (due to large startup times)

Because Julia fails to precompile all required code for plotting, the loading times are too large.

Add support for a "server" that would accept requests and do the plotting in the same process.

About updating data

Good Morning. My question is about thanks and not any other type of error. For a long time I scanned the internet for a reliable source and a database form to use in my web application and now I finally met you. I am very happy to have found this repository. I would like to know if there is a possibility that someday you will not be updated or if you do not want this work anymore, because I am using your data grouped for my analysis and it would be very good to be able to count on this incredible update from you. You have my admiration. This work you did is fantastic and here in Brazil it will help us a lot in the analysis of covid19. When I finish this job how do I reference your repository? Ah you can close this issue when you read. A big hug.

Script tool for correcting province/country notation errors/ambiguities

Due to the immense data inconsistencies on referencing [country]'s and/or [province]'s I've made a scripting tool for easy defining corrections on the fields of the JHU-daily files. If I see misspellings in the [country] or [province] fields I can simply add the reference and a correction into the script and re-work the combined JHU-file with the updated script into a tsv-file (readable for instance directly by msaccess, excel).
"commands" in the scriptfile are simple lists:

// Commands for redefinition/standardizing of Country/province-notations
//       Syntax-help at end of file 
replace f1    // delete entries for province/state (=f1)
              ("None","Bavaria" = "")
         .
replace f2     // standardize entries for Country/region  (=f2)
           ("Mainland China","*Hong Kong*","Macao SAR","Macau" = "China")
           ("Viet Nam" = "Vietnam")
           ("Czech Republic"="Czechia" )
           ("Republic of Ireland" = "Ireland")
           ("Republic of Korea","Korea, South" = "South Korea")
           ("Republic of Moldova" = "Moldova")
           ("Cabo Verde" = "Cape Verde")  // don't know whether this should be done
           ("*Gambia*" = "Gambia")            // standardizes "The Gambia" and "Gambia, The"
           ("Holy See" = "Vatican City")
           ("Iran*" = "Iran")                 // delete "islamic republic"
           ("Russia*" = "Russia")             // delete "federation"
           ("occupi*" = "Palestine")          // delete "occupied..."
           ("*Bahamas*" = "Bahamas")          // "The Bahamas", "Bahamas, The"
  .
 // working at two fields at once: AND condition for pattern-testing, moving from one to the other field
replace f1;f2    // move entries for province/state (=f1) into [country] (=f2), nullify [province]
              ("Denmark","France","Netherlands" = "";$)
             .
replace f1;f2      // standardize entries for province (=f1) and for country (=f2) depending on this
            ("Cruise Ship","Diamond Princess" = "Diamond Princess cruise ship";"(Others)")
            ("Grand Princess" = "Grand Princess Cruise Ship";"(Others)")
            .

The basic idea was for my own needs, so it is incorporated in my translation-tool from JHU-files into (readable) tsv-files (with simpler quote-rules for string fields).
So far my reformulating-script is made on observations on typos/errors/mislocation up to the JHU-03-25.csv file. If someone is interested to use this, I make it available for everyone - it is a Windows/Delphi32-application, and I think of it as a free tool.
After I have the scripting tool so far, I've more ideas how to evolve, but I am interested in exchange with possible users (and of course have to overcome the experimental- and alpha phase...).
The cool idea with this is, that the script can be refined in a collaborative manner (as well as I can expand the script-language and -concept as needed).

A inspection tool (in msaccess) helps to find/locate/correct inconsistencies whose resolve can then be incorporated into the script. See for instance the desktop at the moment where I inspect the dtat-check

for the entries for Canada and the changes of naming the provinces from day-to-day. The province-names are adapted by the script already, but we see, that the use of "Alberta" "Calgary, Alberta" and "Edmonton, Alberta" is inconsistent (the same with "Ontario" and so on). To formulate some new script-command for the resolve of this it helps that I appended the original field-contents before correcting at each record. The datafield "Filenr" refers to the daily JHU-file and gives sorting order and with the "Last update"-information helps to identify doublettes.

At the end of this post I've attached the current state of the script. For better readability all comments may be removed (comment: from "//" towards the end-of-line can be deleted)

I'm new to github and don't know about good methods of communication here. You can always use my email helms (at) uni-kassel.de

current scriptfile
recode_seqfile_script.txt

Please restore combined datasets (perhaps as releases)

Hi,

I understand the issue with the combined dataset. However, this was very helpful.

https://docs.github.com/en/github/managing-large-files/distributing-large-binaries discusses options for this. Over at https://github.com/jgoerzen/covid19db I generate a dataset that aggregates from yours and some others. I use a Github Action that automatically publishes it to Github as a release; see https://github.com/jgoerzen/covid19db/blob/master/.github/workflows/build.yml

It should be fairly easy to do that here.

Also, locations-diff.tsv is important for some analysis, and it would be nice to see it back as well.

Thanks,

John

Document JSON, TSV and SQL exports schema

Although the name of the fields are self-explanatory, perhaps a clear description of each exported field (and how it was computed) would be useful.

Investigate "French Polynesia" confirmed spike in JHU dataset

From the graphs, it seems that there is something wrong with the French Polynesia counts in the JHU dataset. Check to see if perhaps during cleaning they weren't missattributed.

Round computed values to 3 significant digits after the decimal point

At the moment computed values are exported with all the decimals JavaScript is capable of. However all that "precision" is pointless, given that how "fuzzy" the data is, and only "clutters" visually the exported files.

Therefore rounding all values to 3 significant digits after the decimal point would be better.

Add support for SQLite output for datasets

Add support for both .sql and .db files for each of the datasets ready for SQLite querying.

There are missing days from JHU Dataset

Used by bogdanvso / diseases_risk_analysing and thanks

I'm using your derived data for one my personal pet-project (www.covid-info.live). It provides world/country statistics, top-3 by major metrics, some basic analytics related to medical resources situation (e.g. correlation between number of beds in hospitals and Case fatality rate) and calculates approximate risk to be infected and to die in a given country relatively to a person age/comorbid diseases/medical environment.

I've added your repo link to the site's disclaimer and into my repo
Thank you for your work!

Add support for spreadsheet-like viewing directly in the browser

It would be nice to have a HTML based spreadsheet-like view of the datasets.

The simplest and most flexible library to support this seems to be https://handsontable.com/

NY Times counties data stupped updating on Jan 17

Hi,

The upstream NY Times data is still updating, but the NY Times county data in here looks to have stopped updating on Jan. 17.

Thanks!

Used by jojo4u/covid-19-graphs-jo and thanks!

I'm using your derived data in my little python graph project. It shows cumulative and daily cases/deaths per capita since in the beginning everybody only reported absolute data. If you're interested you can add https://github.com/jojo4u/covid-19-graphs-jo to the "Used By" section of your readme (I figured a pull request would be a bit overkill for such a small change).

I want to give you a big "thank you" for your effort to consolidate and augment the CSSE data. At first I used https://github.com/datasets/covid-19/ but since the infamous data change at CSSE it missed US state data. Your jhu/daily data includes it and also has population :)

Add visualizations based on NY Times dataset for US states

Add support for projections and estimates based on mathematical models

I haven't thought thoroughly about this, however the following could prove useful:

valeriupredoi/COVID-19_LINEAR#11

Add visualizations based on ECDC dataset

Hopefully for countries, the ECDC dataset is more reliable than JHU; therefore add visualizations based on the ECDC dataset.

Any other datasets (with a certain level of quality) to be integrated?

As I have already merged the JHU, NY Times and ECDC datasets, I am wondering if there are any other datasets (that are of an acceptable level of quality) that I could integrate?

JHU dataset doesn't contain the last day data

Hi,
There is 2020-04-23 as the last date in jhu dataset, at the same time JHU provides the data about 2020-04-24 at its GitHub and website
Please, do force data update/fix updating schedule
Thank you

Integrate ECDC dataset as alternative to the JHU dataset

ECDC (the european CDC counterpart) has published a dataset (at country level), which might be a good alternative or cross-check for the JHU dataset:

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

The actual dataset link in CSV:

https://opendata.ecdc.europa.eu/covid19/casedistribution/csv

(There is also an JSON and Excel one, but given the current workflow, the CSV is the best alternative.)

Rewriting the history to remove all output files (due to excessive repository size that hit GitHub limit's)

[Also in the attention of the following users that have forked my repository at various points: @amirunpri2018, @Dithn, @elektrotiko, @hmpandey, @jgoerzen, @rafaelsabino, @sbw78, @stillnotjoy.]

Update: at the moment all the original, intermediary and derived files (and plots) are available at (https://data.volution.ro/ciprian/f8ae5c63a7cccce956f5a634a79a293e/); see the readme in the project for details.

Unfortunately early in April I've hit the GitHub 100 GiB repository limit. This is with all my effort to compress the files (with a git friendly, i.e. "synchronizable", tool like gzip --rsyncable or zstd --rsyncable), and with all my hope that the rest of the "text-only" files would compress nicely with git's own packing algorithm (based on deltas).

Thus, in order to fix the issue, and start re-generating the datasets, I had to take the following measures:

I've rewritten the history to remove all output files (binary or text), all with the exception of status.json that contains only the latest values;
I've also removed the plots, that changed quite dramatically on each regeneration; (and thus didn't pack nicely;)
(none of these files will be added in the future to this repository;)

However, in the next couple of days I'll republish the output files outside of GitHub, and I'll link them in the readme.

Thus this repository will contain only:

the sources and scripts to process and augment the data;
the input files as found on JHU / NY Times / ECDC repositories; (I've opted to keep these in case the original sources are changed or dissapear; the output files can always be re-generated; these files can't be recreated once they dissapear;)

Moreover, because there are a couple of forks to this repository that contain the old history, and because that still causes troubles with GitHub due to the excessive repository sizes, I would kindly ask those that have forked my repository to either remove their forks, or to reset their histories (and push to their GitHub fork) to the current master (that holds the cleaned history).

If anyone needs help with how to reset their forks, please comment on this issue, and I'll provide some snippets.

Thanks, and sorry for the trouble (both to GitHub and the fellow users that have forked my repository)!

Introduce (approximate) cohorte-views for deaths-rates?

Dear Ciprian - I was thinking about the change-of-deathrates (deaths/confirmed) in a couple of inspected countries.

Taking your rich timeseries dataset (applause for it) I tried various parameters for the mean individual test-to-death lifespan.

Thus in an excel-file I computed the ratios from the accumulated files [deaths]_{d+lag} / [confirmed]_d where d is the day-index and lag the estimated lifespan from the event of being tested to the event of deaths First, for Germany I got a stabilizing ratio of ca 6% when the lag was about 12 days, and for Italy of ca 24% when the lag was about 5 days.

These are crude guesses from inspection of the curves when the lag parameter was changed and the most horicontal / constant form of the curve was selected as most appropriate.

It is a bit laborous to reconfigure this in excel for various more countries and changeable parameters plus making visually parallel curves having neighboured parameters for lag by hand. So after this little peek-into-data I've paused so far.

In short: the idea behind is, to define such best lag parameter and from this estimate the country-specific deathrate-in-cohortes where the cohortes are estimated via the lag, meaning the (average) lifespan after (country-specific governmentally granted) testing-for-corona-positivity.
This felt to be a very good idea in principle, and lacks the true cohort-determination. But possibly extendable with new ideas and data.

I tried to improve the meaningfulness of such estimates by using the daily-change versions of the data instead of the accumulated data, and moreover by using the infected instead of the confirmed values for the lagged day-indexes. While the occuring tendencies seem to suggest the same best lag parameter, the curves are more oscillating, and thus possibly more irritating the reader. Again: maybe here could some more insight be helpful by checking more countries and see whether this gives indeed some approximately consistent rules for a general approach ...

Integrate Romania's county level dataset from `geo-spatial.org`

The best curated dataset for Romania is available from geo-spatial.org. (However it remains to be investigated where the data is available.)

Add `how to cite` and `used by` section to the readme

As the title says, add two sections to the readme:

"how to cite" -- giving short snippets on how to cite this repository and the original work;
"used by" -- giving a list of sites that use this derived dataset;

For JHU `series` dataset also include the `time_series_*_US.csv`

It seems JHU has also published time series for US in:

Add support for `zstd` compressed files replacing `gzip ones

With a lot of large dataset files (including JHU daily and combined) we have already surpassed GitHub's 100 MiB file limit.

I will switch all compressed files from .gz to .zstd.

Add support for heatmap visualizations

A good example would be: https://covid19.geo-spatial.org/dashboard/statistici/situatie-europa

Integrate Italy's official dataset as augmentation for JHU dataset

On CSSEGISandData/COVID-19#1677 someone pointed to the "official" italian dataset (at state and county level) that can be a good source to augment the JHU data especially for Italy:

https://github.com/pcm-dpc/COVID-19

Integrate NY Times dataset for US as alternative to JHU dataset

On CSSEGISandData/COVID-19#1699 someone pointed to the New York Times dataset for US (up to state / county level) which might be a good alternative / cross-checking for the JHU dataset:

https://github.com/nytimes/covid-19-data

Also export the TSV data per individual country

On the JHU repository someone asked if it would be possible to make it easier for one to find only a particular country's data:

CSSEGISandData/COVID-19#2065

Therefore perhaps an export (only for the country level, and in case of US up to state level) for each individual country would be a good idea.

Observations on `v1-*.tsv` -- proposal to repair inconsistencies

Hi Ciprian -
I've downloaded your nice file and have imported it into Excel and Msaccess.
With the query-tools in msaccess I've looked a bit at consistency of country-/province-issues that I've seen already in the JHU-original datasets. Something like

first four JHU files document under "Germany";"Bavaria" and later files omit the [province] although the records are clearly continuations of the first four. This has went into your dataset. "Bavaria" should simply be nulled to allow all the "Germany" records are under the same key
"Aruba" is found in [country] but also is found under [country];[province]="Netherlands";"Aruba" . This should be made consistent
And some more. (-UK; -St Barthelemy with and without "France", different char-code in first "e"; -...)

Have been just peeking into the dataset, no rigorous protocol so far. If you don't think this is of concern for the real use of the dataset I'm fine with it, but I can also try to contribute a more detailed protocol of (possible/guesses) issues.

Add support for SQL files ready for PostgreSQL and MySQL

In addition to #13 which will provide for SQLite support, add files ready for PostgreSQL and MySQL loading.

Sqlite3 schema suggestions

Hi,

There are a few schema suggestions I would like to make.

When the values in the absolute_* and delta_* columns are zero, put a zero there rather than a NULL. Both columns should be defined NOT NULL in the schema.
Removing the FIPS codes from the county-related rows makes it more difficult to cross-reference with other sources. It would be great to have those still present, or a separate locations table that could be cross-referenced using the location_key to look it up.
A more relational model could be useful; for instance, there wouldn't necessarily be a reason to duplicate all the factbook entries on every row when they could be correlated with a separate factbook table at select time.

On another topic, there doesn't appear to be a row for every day in the table, although this is present in the source material. It appears rows are omitted when all of the delta_* values would be zero. Although that reduces storage space, it makes it substantially more difficult to perform analyses using WHERE date = type of clauses. For instance, creating a sum of cases over a set of counties or something would normally be possible using a SELECT but couldn't be done here, since the rows are omitted.

Change country order and colors in visualization based on number of confirmed cases

At the moment the order of countries in the plots is "hard-coded". However when there are more than a few countries, the colors start to "seem" similar.

A solution would be to order the countries based on total confirmed cases, and assign colors based on that order. (Therefore the countries would form a gradient from "red" the hardest hit, to "blue" and "purple" the less hit ones.)

Correctly compute `day_index_*` as days difference between the current day and the previous day

In some cases there are date gaps in the dataset (i.e. days are missing). In such cases the delta between two consecutive items with day_index_* should be equal to the actual number of days that have passed between the two datapoints.

nytimes data no longer updating

Hi,

Thanks for this resource! It looks, by the way, like the New York Times data is no longer updating. Last update was 4 days ago. Are you able to restore that data?

Thanks!

Add simple "current situation" file with only the latest values per each location.

From: CSSEGISandData/COVID-19#1250 (comment)

Unfortunately it looks way too complex for my needs and like I said, I'm looking for something very simple aka: countryName, totalCases, totalDeceased and totalRecovered.

From: CSSEGISandData/COVID-19#1281 (comment)

Thought I would submit a request here if I may.
Is there any way to create a simple JSON dataset along the lines of, for example
"country": "Australia"
"cases": 4860
"deaths": 20
"recovered": 244

Replace all `values.json` and `values.tsv` files with `gzip`-ed variants

At the moment many of the values.* dataset files are approaching the 100 MiB limit which is enforced by GitHub.

All these files will be committed only in compressed format from now on.

As a consequence the following actions have been taken with regard to the exports folder:

replaced values.json and values.tsv with values.json.gz and values.tsv.gz; (i.e. gzip-ed variants;)
removed values.tsv and status.json from the root of the JHU exports folder; (these are in fact copies of the daily files of the same name;)
removed all values.txt files; (they were mainly used for internal debugging, and the values.tsv contain exactly the same data;)

For the moment the values.tsv files are still kept in the exports folder, but will soon be removed in favor of the values.tsv.gz.

day_index_peak_* which computes for each row, how many days before or after the maximum value of that metric this row is;
peakpct_* which divides the current value to the peak value;
the peak value should be computed as daily delta towards the previous day, not the cumulative value; (else the peak day would be always the last day;)

cipriancraciun / covid19-datasets Goto Github PK

covid19-datasets's People

Contributors

Stargazers

Watchers

Forkers

covid19-datasets's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs