covid19-eu-zh / covid19-eu-data Goto Github PK
View Code? Open in Web Editor NEWAutomated Data Collection: COVID-19/SARS-COV-2 Cases in EU by Country, State/Province/Local Authorities, and Date
Automated Data Collection: COVID-19/SARS-COV-2 Cases in EU by Country, State/Province/Local Authorities, and Date
Some data should be merged for:
and:
Cheers,
Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data frame https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/YWVN3B
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.
Hi and thanks for this great effort. There is one very confusing issue with your data, is that it's using a date format which is almost exactly like ISO dates, but it has the months and days swapped.
This is an extremely confusing date format if on purpose, or simply a bug if not.
The right ISO 8601 format can be found on Wikipedia: https://en.wikipedia.org/wiki/ISO_8601
It looks like this: 2020-04-13T00:55
I've implemented in R a function to download the data (see download.state() and download.raw())
However, I cannot find a reasonable way to map between districts (IdLandkreise) and nuts_3
If anyone knows how to do it I'll convert the code to python to be included here.
Decimal problem
Ref: #27
We have been using a sum row to represent the whole country. It is a bizarre name amongst the state/region/province names.
Instead of using different names for geo locations in different countries, we could use NUTS levels. Since we are dealing with Europe, this
country | NUTS1 | NUTS2 | NUTS3 |
---|---|---|---|
... | ... | ... | ... |
Or we use a more general name
country | administrative_level_1 | administrative_level_2 | administrative_level_3 |
---|---|---|---|
... | ... | ... | ... |
The key to this change is the missing data. We have to deal with missing data properly.
In IT, there are cases unassigned to regions. We are using missing values in the province column to represent them.
missing
.Here are some examples.
country | NUTS1 | NUTS2 | NUTS3 | cases |
---|---|---|---|---|
DE | NRW | 123 |
indicates the numbers for NRW is 123.
country | NUTS1 | NUTS2 | NUTS3 | cases |
---|---|---|---|---|
DE | MISSING | 123 |
indicates the number of cases that are not assigned to states is 123.
country | NUTS1 | NUTS2 | NUTS3 | cases |
---|---|---|---|---|
DE | 123 |
indicates the numbers for DE is 123.
We choose to use time slots to manage pipelines because it was the simplest working model in the beginning. But this is failing as the number of workflows increases.
The workflows are almost independent of each other. The data scraping steps of the workflows are not interfering with each other but the data integration steps are.
As new data is downloaded, we will have to commit and push the data back to the repo. We have got git push clashing.
Not so serious at the moment as we are already arrange the workflows into different time slots. The clashing happens mostly during a new commit push which triggers all the workflows.
Instead of having each country as a workflow, we combine them into one or a few workflows.
Rearrange all the workflows.
We use a pipeline management system to manage the pipelines. luigi?
We have to remove some of the data and reinitialize.
Hello everyone,
I found that some files for daily Wales data have a row where the column cases
has the value "cases"
:
csv文件中第9行与第31行数据有误,对应 authority 显示为 sum,似乎应该是某大区的数据?
Given that city numbers are not complete in the sense that they don't add up to the national total, and that no deaths/hospitalizations are reported on the city level, and that given the low test rate, deaths and hospitalizations are likely to be more meaningful than the confirmed cases #, perhaps it is a good idea to add national numbers which can be found at https://www.rivm.nl/nieuws/actuele-informatie-over-coronavirus.
Hello again,
I believe these Netherlands cities should merge, don't know which one is correct though:
's-Gravenhage
and
s-Gravenhage
Cheers
Hey, they probably changed the RKI page again, pandas seems to fail to read the page properly..:
Traceback (most recent call last):
File "covid19-eu-data/scripts/download_de.py", line 84, in <module>
cov_de.workflow()
File "covid19-eu-data\scripts\utils.py", line 100, in workflow
self.extract_table()
File "covid19-eu-data/scripts/download_de.py", line 42, in extract_table
self.df = req_dfs[0][["Bundesland", "Fälle", "Todesfälle"]]
File "lib\site-packages\pandas\core\frame.py", line 2806, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "lib\site-packages\pandas\core\indexing.py", line 1551, in _get_listlike_indexer
self._validate_read_indexer(
File "lib\site-packages\pandas\core\indexing.py", line 1645, in _validate_read_indexer
raise KeyError(f"{not_found} not in index")
KeyError: "['Todesfälle', 'Fälle'] not in index"
Process finished with exit code 1
the number in line 80 and 81 seems to be wrong
I've added a simple implementation to pull IT data, I did too many changes in my fork to do a PR, but you can integrate it from here: https://github.com/gena/covid19-eu-data/blob/master/scripts/download_it.py. It simply pulls the data from another repo, but it's handy to have all data in one place.
Hey,
the healed persons in AT are counted/added as "cases" and creates wrong data..
This dataset is Tirol:
[2, 2, 3, 4, 4, 7, 7, 8, 8, 8, 9, 16, 27, 2, 32, 2, 37, 2, 57, 2, 81, 2, 109]
(You may see the issue... 27, 2, 32, 2, 37, 2 - the 2 is probably the healed person count)
Regards
SeaLife
For example 1 019
instead of 1019
Files:
England webpage changed and we did not receive data for several days.
The dates for different countries in the ECDC report are different.
Fix:
Hi there,
the data for Poland seem to be cumulative counts before 2020-11-23 and daily counts afterwards (cases
, deaths
, tests
).
The nuts_2
column is also mixed with the same regions referenced with different strings due to encoding issues.
It would be great to fix that by standardizing this file.
That would be really helpful as this is the only source for Poland at regional level I could find!
Many thanks,
Emanuele
https://www.data.gv.at/covid-19/ lists better data sets down to NUTS 3 instead of NUTS 2. An example would be https://www.data.gv.at/katalog/dataset/4b71eb3d-7d55-4967-b80d-91a3f220b60c. There's also data about hospitalizations, age and more.
Please (also) store more granular data from better sources than the current one by NUTS 2.
France is a very strange country, as it doesn't publish cases data. It provides hospitalized and deaths but not cases. I don't know what column is the script scraping, but it should be renamed to something else.
I also cross-checked with JHU and it is showing a very different number, so definitely not cases.
Hello, thanks for all the great work in this repo! Can you please provide an explicit license for the data? Preferably, something with as few restrictions as possible...
Files in Windows cannot contain any of the following characters:
\ / : * ? " < > |
Because of this is, Windows users will get this error while trying to clone the repo:
error: invalid path 'cache/daily/at/2020-03-25T15:00:00.html'
Is this something that can be fixed?
Starting on 2021-02-03 day and month are messed up in dataset/covid-19-pl.csv and daily/pl/ . So for example 2021-03-02 instead of 2021-02-03.
Additionally there is no data for 2021-02-01.
https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html
加了死亡人数这一栏
且 21:30 又一次更新
The date
field has changed in this dataset from a proper datetime format to a date text field. Is there any chance that might get reverted? Even a ...T00:00:00Z
append would help for those building on this data. This would make this dataset consistent with the others in the series.
Could you please provide county data, that would be very grateful!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.