covid19-eu-zh / covid19-eu-data Goto Github PK

View Code? Open in Web Editor NEW

80.0 7.0 20.0 13.34 GB

Automated Data Collection: COVID-19/SARS-COV-2 Cases in EU by Country, State/Province/Local Authorities, and Date

Python 0.08% HTML 99.92% Shell 0.01%

covid-19 sars-cov-2 coronavirus covid19-data dataset

covid19-eu-data's People

Contributors

Stargazers

Watchers

Forkers

glacier-ice chiangkim diegosiqueir4 gena satellitetjc nullnotfound fagan2888 jcrow06 gutengzczy leviticusmb johnharveymath datumorphism dataherb wassilyhan maxkotz17 kausalflow butzk juliusjulius simmieyungie

covid19-eu-data's Issues

Netherland Cities Encoding Issue

Some data should be merged for:

Noardeast-Fryslân
Noardeast-FryslÃ¢n

and:

SÃºdwest-FryslÃ¢n
Súdwest Fryslân
Súdwest-Fryslân

Cheers,

Metadata Update

Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data frame https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/YWVN3B
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.

Fix date format to ISO date

Hi and thanks for this great effort. There is one very confusing issue with your data, is that it's using a date format which is almost exactly like ISO dates, but it has the months and days swapped.

This is an extremely confusing date format if on purpose, or simply a bug if not.

The right ISO 8601 format can be found on Wikipedia: https://en.wikipedia.org/wiki/ISO_8601

It looks like this: 2020-04-13T00:55

DE RKI data by district -- help on how to map between districts and nuts_3

I've implemented in R a function to download the data (see download.state() and download.raw())

However, I cannot find a reasonable way to map between districts (IdLandkreise) and nuts_3

If anyone knows how to do it I'll convert the code to python to be included here.

SE data has 2 columns of 'cases/100k pop.'

cases/100k pop. of DE is wrong

Decimal problem

Regulate Administrative Divisions

Why

Ref: #27

We have been using a sum row to represent the whole country. It is a bizarre name amongst the state/region/province names.

Solutions

Use proper geo levels: NUTS or LAU

Instead of using different names for geo locations in different countries, we could use NUTS levels. Since we are dealing with Europe, this

country	NUTS1	NUTS2	NUTS3
...	...	...	...

Or we use a more general name

country	administrative_level_1	administrative_level_2	administrative_level_3
...	...	...	...

The key to this change is the missing data. We have to deal with missing data properly.

In IT, there are cases unassigned to regions. We are using missing values in the province column to represent them.

If the numbers are unassigned, we indicate using a keyword such as missing.
For statistics of higher-level admin divisions, we using missing values.

Here are some examples.

country	NUTS1	NUTS2	NUTS3	cases
DE	NRW			123

indicates the numbers for NRW is 123.

country	NUTS1	NUTS2	NUTS3	cases
DE	MISSING			123

indicates the number of cases that are not assigned to states is 123.

country	NUTS1	NUTS2	NUTS3	cases
DE				123

indicates the numbers for DE is 123.

References

In EU, we have NUTS and LAU
1. NUTS: three levels
  1. NUTS1
  2. NUTS2
  3. NUTS3
Administrative Divisions of Countries ("Statoids")
1. Three levels:
  1. Primary
  2. Secondary
  3. Tertiary
Wikipedia has a page showing the divisions: https://en.wikipedia.org/wiki/List_of_administrative_divisions_by_country
1. Three levels
  1. First-level
  2. Second-level
  3. Third-level

Cron job based pipeline management is failing

We choose to use time slots to manage pipelines because it was the simplest working model in the beginning. But this is failing as the number of workflows increases.

Why is it failing

The workflows are almost independent of each other. The data scraping steps of the workflows are not interfering with each other but the data integration steps are.

As new data is downloaded, we will have to commit and push the data back to the repo. We have got git push clashing.

How serious is it

Not so serious at the moment as we are already arrange the workflows into different time slots. The clashing happens mostly during a new commit push which triggers all the workflows.

Solution

Remove the push on master trigger

Combined Workflows

Instead of having each country as a workflow, we combine them into one or a few workflows.

No badges for each country.
Fail as a cluster.

Better Time slots

Rearrange all the workflows.

Use pipeline management system

We use a pipeline management system to manage the pipelines. luigi?

No badges for each country.

Storage size reached Github Hard Limit

We have to remove some of the data and reinitialize.

Wales' cases is the string 'cases' in some files

Hello everyone,

I found that some files for daily Wales data have a row where the column cases has the value "cases":

dataset/covid-19-fr.csv数据错误

csv文件中第9行与第31行数据有误，对应 authority 显示为 sum，似乎应该是某大区的数据？

For Netherlands: country level death and hospitalization numbers?

Given that city numbers are not complete in the sense that they don't add up to the national total, and that no deaths/hospitalizations are reported on the city level, and that given the low test rate, deaths and hospitalizations are likely to be more meaningful than the confirmed cases #, perhaps it is a good idea to add national numbers which can be found at https://www.rivm.nl/nieuws/actuele-informatie-over-coronavirus.

Netherlands cities should merge

Hello again,
I believe these Netherlands cities should merge, don't know which one is correct though:

's-Gravenhage

and

s-Gravenhage

Cheers

RKI

Hey, they probably changed the RKI page again, pandas seems to fail to read the page properly..:

Traceback (most recent call last):
  File "covid19-eu-data/scripts/download_de.py", line 84, in <module>
    cov_de.workflow()
  File "covid19-eu-data\scripts\utils.py", line 100, in workflow
    self.extract_table()
  File "covid19-eu-data/scripts/download_de.py", line 42, in extract_table
    self.df = req_dfs[0][["Bundesland", "Fälle", "Todesfälle"]]
  File "lib\site-packages\pandas\core\frame.py", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "lib\site-packages\pandas\core\indexing.py", line 1551, in _get_listlike_indexer
    self._validate_read_indexer(
  File "lib\site-packages\pandas\core\indexing.py", line 1645, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['Todesfälle', 'Fälle'] not in index"

Process finished with exit code 1

covid-19-fr.csv sum is wrong

the number in line 80 and 81 seems to be wrong

IT

I've added a simple implementation to pull IT data, I did too many changes in my fork to do a PR, but you can integrate it from here: https://github.com/gena/covid19-eu-data/blob/master/scripts/download_it.py. It simply pulls the data from another repo, but it's handy to have all data in one place.

NL data are bad after March 31

File format change?

Hungary, Slovenia, North Macedonia covid-19 data

Hungary: https://koronavirus.gov.hu/
Slovenia: https://www.gov.si/en/topics/coronavirus-disease-covid-19/
North Macedonia: https://gdi-sk.maps.arcgis.com/apps/opsdashboard/index.html?fbclid=IwAR0Dd9MY7njiNtDkPpPt8R2SeD4pW_6TO12axwKrT4CcegckY4P4Ezt43f4#/2096bd4b051b42948ac3f5747e80c3a5

AT bug with "healed" persons

Hey,

the healed persons in AT are counted/added as "cases" and creates wrong data..

This dataset is Tirol:

[2, 2, 3, 4, 4, 7, 7, 8, 8, 8, 9, 16, 27, 2, 32, 2, 37, 2, 57, 2, 81, 2, 109]

(You may see the issue... 27, 2, 32, 2, 37, 2 - the 2 is probably the healed person count)

Regards
SeaLife

Poland files have space in case numbers

For example 1 019 instead of 1019

Files:

England didn't run for several days due to change of layout

What is happening

England webpage changed and we did not receive data for several days.

How to fix

Fix England
Generate a chart to show the dates in the dataset of each country and show it on Readme.md
Setup an alert if we didn't get the data by the end of the day.

ECDC report data do not have the dates aligned

The dates for different countries in the ECDC report are different.

Fix:

Use the full dataset instead.

Big issues with Poland

Hi there,
the data for Poland seem to be cumulative counts before 2020-11-23 and daily counts afterwards (cases, deaths, tests).
The nuts_2 column is also mixed with the same regions referenced with different strings due to encoding issues.
It would be great to fix that by standardizing this file.
That would be really helpful as this is the only source for Poland at regional level I could find!
Many thanks,
Emanuele

Austria: Better data sources available

https://www.data.gv.at/covid-19/ lists better data sets down to NUTS 3 instead of NUTS 2. An example would be https://www.data.gv.at/katalog/dataset/4b71eb3d-7d55-4967-b80d-91a3f220b60c. There's also data about hospitalizations, age and more.

Please (also) store more granular data from better sources than the current one by NUTS 2.

France cases should be renamed

France is a very strange country, as it doesn't publish cases data. It provides hospitalized and deaths but not cases. I don't know what column is the script scraping, but it should be renamed to something else.

I also cross-checked with JHU and it is showing a very different number, so definitely not cases.

License

Hello, thanks for all the great work in this repo! Can you please provide an explicit license for the data? Preferably, something with as few restrictions as possible...

HTML files in ./cache/daily/ sub folders are invalid for Windows Users

Files in Windows cannot contain any of the following characters:
\ / : * ? " < > |

Because of this is, Windows users will get this error while trying to clone the repo:
error: invalid path 'cache/daily/at/2020-03-25T15:00:00.html'

Is this something that can be fixed?

Poland: wrong date starting on 2021-02-03

Starting on 2021-02-03 day and month are messed up in dataset/covid-19-pl.csv and daily/pl/ . So for example 2021-03-02 instead of 2021-02-03.
Additionally there is no data for 2021-02-01.

RKI 改了一点版

https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html
加了死亡人数这一栏
且 21:30 又一次更新

Time format changed.

The date field has changed in this dataset from a proper datetime format to a date text field. Is there any chance that might get reverted? Even a ...T00:00:00Z append would help for those building on this data. This would make this dataset consistent with the others in the series.

County data

Could you please provide county data, that would be very grateful！