GithubHelp home page GithubHelp logo

covid19-eu-zh / covid19-eu-data Goto Github PK

View Code? Open in Web Editor NEW
80.0 7.0 20.0 13.34 GB

Automated Data Collection: COVID-19/SARS-COV-2 Cases in EU by Country, State/Province/Local Authorities, and Date

Python 0.08% HTML 99.92% Shell 0.01%
covid-19 sars-cov-2 coronavirus covid19-data dataset

covid19-eu-data's People

Contributors

actions-user avatar emptymalei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

covid19-eu-data's Issues

Netherland Cities Encoding Issue

Some data should be merged for:

  • Noardeast-Fryslân
  • Noardeast-Fryslân

and:

  • Súdwest-Fryslân
  • Súdwest Fryslân
  • Súdwest-Fryslân

Cheers,

Fix date format to ISO date

Hi and thanks for this great effort. There is one very confusing issue with your data, is that it's using a date format which is almost exactly like ISO dates, but it has the months and days swapped.

This is an extremely confusing date format if on purpose, or simply a bug if not.

The right ISO 8601 format can be found on Wikipedia: https://en.wikipedia.org/wiki/ISO_8601

It looks like this: 2020-04-13T00:55

Regulate Administrative Divisions

Why

Ref: #27

We have been using a sum row to represent the whole country. It is a bizarre name amongst the state/region/province names.

Solutions

Use proper geo levels: NUTS or LAU

Instead of using different names for geo locations in different countries, we could use NUTS levels. Since we are dealing with Europe, this

country NUTS1 NUTS2 NUTS3
... ... ... ...

Or we use a more general name

country administrative_level_1 administrative_level_2 administrative_level_3
... ... ... ...

The key to this change is the missing data. We have to deal with missing data properly.

In IT, there are cases unassigned to regions. We are using missing values in the province column to represent them.

  1. If the numbers are unassigned, we indicate using a keyword such as missing.
  2. For statistics of higher-level admin divisions, we using missing values.

Here are some examples.

country NUTS1 NUTS2 NUTS3 cases
DE NRW 123

indicates the numbers for NRW is 123.

country NUTS1 NUTS2 NUTS3 cases
DE MISSING 123

indicates the number of cases that are not assigned to states is 123.

country NUTS1 NUTS2 NUTS3 cases
DE 123

indicates the numbers for DE is 123.

References

  1. In EU, we have NUTS and LAU
    1. NUTS: three levels
      1. NUTS1
      2. NUTS2
      3. NUTS3
  2. Administrative Divisions of Countries ("Statoids")
    1. Three levels:
      1. Primary
      2. Secondary
      3. Tertiary
  3. Wikipedia has a page showing the divisions: https://en.wikipedia.org/wiki/List_of_administrative_divisions_by_country
    1. Three levels
      1. First-level
      2. Second-level
      3. Third-level

Cron job based pipeline management is failing

We choose to use time slots to manage pipelines because it was the simplest working model in the beginning. But this is failing as the number of workflows increases.

Why is it failing

The workflows are almost independent of each other. The data scraping steps of the workflows are not interfering with each other but the data integration steps are.

As new data is downloaded, we will have to commit and push the data back to the repo. We have got git push clashing.

How serious is it

Not so serious at the moment as we are already arrange the workflows into different time slots. The clashing happens mostly during a new commit push which triggers all the workflows.

Solution

Remove the push on master trigger

Combined Workflows

Instead of having each country as a workflow, we combine them into one or a few workflows.

  1. No badges for each country.
  2. Fail as a cluster.

Better Time slots

Rearrange all the workflows.

Use pipeline management system

We use a pipeline management system to manage the pipelines. luigi?

  1. No badges for each country.

Netherlands cities should merge

Hello again,
I believe these Netherlands cities should merge, don't know which one is correct though:

's-Gravenhage

and

s-Gravenhage

Cheers

RKI

Hey, they probably changed the RKI page again, pandas seems to fail to read the page properly..:

Traceback (most recent call last):
  File "covid19-eu-data/scripts/download_de.py", line 84, in <module>
    cov_de.workflow()
  File "covid19-eu-data\scripts\utils.py", line 100, in workflow
    self.extract_table()
  File "covid19-eu-data/scripts/download_de.py", line 42, in extract_table
    self.df = req_dfs[0][["Bundesland", "Fälle", "Todesfälle"]]
  File "lib\site-packages\pandas\core\frame.py", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "lib\site-packages\pandas\core\indexing.py", line 1551, in _get_listlike_indexer
    self._validate_read_indexer(
  File "lib\site-packages\pandas\core\indexing.py", line 1645, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['Todesfälle', 'Fälle'] not in index"

Process finished with exit code 1

AT bug with "healed" persons

Hey,

the healed persons in AT are counted/added as "cases" and creates wrong data..

This dataset is Tirol:

[2, 2, 3, 4, 4, 7, 7, 8, 8, 8, 9, 16, 27, 2, 32, 2, 37, 2, 57, 2, 81, 2, 109]

(You may see the issue... 27, 2, 32, 2, 37, 2 - the 2 is probably the healed person count)

Regards
SeaLife

England didn't run for several days due to change of layout

What is happening

England webpage changed and we did not receive data for several days.

How to fix

  • Fix England
  • Generate a chart to show the dates in the dataset of each country and show it on Readme.md
  • Setup an alert if we didn't get the data by the end of the day.

Big issues with Poland

Hi there,
the data for Poland seem to be cumulative counts before 2020-11-23 and daily counts afterwards (cases, deaths, tests).
The nuts_2 column is also mixed with the same regions referenced with different strings due to encoding issues.
It would be great to fix that by standardizing this file.
That would be really helpful as this is the only source for Poland at regional level I could find!
Many thanks,
Emanuele

France cases should be renamed

France is a very strange country, as it doesn't publish cases data. It provides hospitalized and deaths but not cases. I don't know what column is the script scraping, but it should be renamed to something else.

I also cross-checked with JHU and it is showing a very different number, so definitely not cases.

License

Hello, thanks for all the great work in this repo! Can you please provide an explicit license for the data? Preferably, something with as few restrictions as possible...

Poland: wrong date starting on 2021-02-03

Starting on 2021-02-03 day and month are messed up in dataset/covid-19-pl.csv and daily/pl/ . So for example 2021-03-02 instead of 2021-02-03.
Additionally there is no data for 2021-02-01.

Time format changed.

The date field has changed in this dataset from a proper datetime format to a date text field. Is there any chance that might get reverted? Even a ...T00:00:00Z append would help for those building on this data. This would make this dataset consistent with the others in the series.

County data

Could you please provide county data, that would be very grateful!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.