GithubHelp home page GithubHelp logo

Comments (4)

cipriancraciun avatar cipriancraciun commented on August 26, 2024

Thanks for reporting.

Regarding the Aruba case, apparently this issue was reported multiple times in the JHU repository, but no action was taken by them:

Myself I've also noticed a major issue with French Polynesia that basically breaks any model for that region (basically they've reported in French Polynesia all the cases for France for a day):


Now, I've thought a little bit about how to handle these inconsistencies:

  • should I override manually these inconsistencies? this will require a lot of effort on my behalf, and I'm not sure I could cover everything, therefore the actual outcome is the same: the resulting dataset can't be believed to be more accurate than the original JHU one;
  • should I automatically drop any rows that take values downward? should I drop the next values that are smaller, or should I drop the previous value that has a peak? I think this would probably generate more issues than it solves;
  • should I average the values so that any inconsistencies are smoothed out? I think this falls in the same category as above;
  • should I do nothing and just provide the data as it is from the original source?

And my current decision is the last one do nothing because in the end my main goal is to provide the data in a more "usable" format; my goal is not to meddle with the data.


That being said, I do ponder about introducing another dataset that takes the original data from JHU (and the others) and "smooths" inconsistencies.

I still have to think about the mathematical model that I would need to employ, but I could lay out the following requirements:

  • the cumulative values to date should be the same as in the original dataset; (i.e. the total should be the same;)
  • the absolute difference between the original data and the "smoothed" data shouldn't be more than a certain threshold, something like 10%; if it passes over that, then the data point should be better removed;
  • the relation between confirmed, recovered and deaths should be kept within the same limit; (i.e. the smoothing shouldn't generate inconsistencies in the ratios for a given day;)
  • the sums in an average window of say 7 days should differ in a similar manner over a certain threshold; (so that analysis done over an window larger than that won't return untrue results;)
  • should be easy to implement, requiring nothing more than a moving window of a certain size of the values, and basic arithmetic operations; (because I need to implement it in Python / jq and don't want to drag in complex mathematical libraries;)

How can I achieve this? I don't know? Perhaps you have a suggestion;

(My thought is a moving window average?)


(I'll post another reply about the other issue.)

from covid19-datasets.

cipriancraciun avatar cipriancraciun commented on August 26, 2024

Regarding the missing data, checking the latest files (as of April 18th), the first Aruba case is on 13th March. It appears correctly in my dataset, i.e. as is in the JHU files. Thus it might be an issue that was corrected later on. (I would strongly suggest using the latest files.)


However you are right, there are gaps in my dataset, for example in case of Aruba, I have the dates 13, 17, 18, 20, 22, 24, etc. for March. (The other dates in between are missing.)

But these are by design because the actual values (for any of the metrics confirmed, deaths and recovered) have not changed due to any of the following reasons:

  • there were really no changes (i.e. no-one new was confirmed, recovered, etc.);
  • the data was not actually reported by that country;
  • the data was not actually collected by JHU;

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

from covid19-datasets.

gottfriedhelms avatar gottfriedhelms commented on August 26, 2024

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

Yes, this is, i think, the best decision. Only for the timeseries-data it might be recommended to "fill-the-gaps" because the user might not have an easy instrument to do this oneself (I can use msaccess with its "crosstabs"-function, which can provide the jhu-timeseries-format instantly).
In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

Again - applause to your big work here, Ciprian!

from covid19-datasets.

cipriancraciun avatar cipriancraciun commented on August 26, 2024

In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

That is why I also imported the ECDC (for global data) and NY Times (for US data) as an alternative for JHU.

Thus I would strongly advise you to also use the ECDC derived data. (It uses exactly the same format as the JHU derived one.)

from covid19-datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.