That issues are tiny bugs, at most, but maybe indicate an issue with your automatized

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues... about covid19-datasets HOT 4 CLOSED

gottfriedhelms commented on August 26, 2024

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues...

from covid19-datasets.

Comments (4)

cipriancraciun commented on August 26, 2024

Thanks for reporting.

Regarding the Aruba case, apparently this issue was reported multiple times in the JHU repository, but no action was taken by them:

Myself I've also noticed a major issue with French Polynesia that basically breaks any model for that region (basically they've reported in French Polynesia all the cases for France for a day):

Now, I've thought a little bit about how to handle these inconsistencies:

should I override manually these inconsistencies? this will require a lot of effort on my behalf, and I'm not sure I could cover everything, therefore the actual outcome is the same: the resulting dataset can't be believed to be more accurate than the original JHU one;
should I automatically drop any rows that take values downward? should I drop the next values that are smaller, or should I drop the previous value that has a peak? I think this would probably generate more issues than it solves;
should I average the values so that any inconsistencies are smoothed out? I think this falls in the same category as above;
should I do nothing and just provide the data as it is from the original source?

And my current decision is the last one do nothing because in the end my main goal is to provide the data in a more "usable" format; my goal is not to meddle with the data.

That being said, I do ponder about introducing another dataset that takes the original data from JHU (and the others) and "smooths" inconsistencies.

I still have to think about the mathematical model that I would need to employ, but I could lay out the following requirements:

the cumulative values to date should be the same as in the original dataset; (i.e. the total should be the same;)
the absolute difference between the original data and the "smoothed" data shouldn't be more than a certain threshold, something like 10%; if it passes over that, then the data point should be better removed;
the relation between confirmed, recovered and deaths should be kept within the same limit; (i.e. the smoothing shouldn't generate inconsistencies in the ratios for a given day;)
the sums in an average window of say 7 days should differ in a similar manner over a certain threshold; (so that analysis done over an window larger than that won't return untrue results;)
should be easy to implement, requiring nothing more than a moving window of a certain size of the values, and basic arithmetic operations; (because I need to implement it in Python / jq and don't want to drag in complex mathematical libraries;)

How can I achieve this? I don't know? Perhaps you have a suggestion;

(My thought is a moving window average?)

(I'll post another reply about the other issue.)

from covid19-datasets.

cipriancraciun commented on August 26, 2024

Regarding the missing data, checking the latest files (as of April 18th), the first Aruba case is on 13th March. It appears correctly in my dataset, i.e. as is in the JHU files. Thus it might be an issue that was corrected later on. (I would strongly suggest using the latest files.)

However you are right, there are gaps in my dataset, for example in case of Aruba, I have the dates 13, 17, 18, 20, 22, 24, etc. for March. (The other dates in between are missing.)

But these are by design because the actual values (for any of the metrics confirmed, deaths and recovered) have not changed due to any of the following reasons:

there were really no changes (i.e. no-one new was confirmed, recovered, etc.);
the data was not actually reported by that country;
the data was not actually collected by JHU;

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

from covid19-datasets.

gottfriedhelms commented on August 26, 2024

However in any case, my reasoning is that just repeating the same values (as JHU is doing) is wrong, because it basically interpolates data, and messes with statistics (like median, average, etc.) and especially with "deltas" and "speed".

Therefore I took the decision of dropping any rows that don't actually provide any new data.

Yes, this is, i think, the best decision. Only for the timeseries-data it might be recommended to "fill-the-gaps" because the user might not have an easy instrument to do this oneself (I can use msaccess with its "crosstabs"-function, which can provide the jhu-timeseries-format instantly).
In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

Again - applause to your big work here, Ciprian!

from covid19-datasets.

cipriancraciun commented on August 26, 2024

In general, when looking at the non-handling the multitude of data-issues by the JHU-team themselves, I don't think a big job of private engagement is recommended - they seem to destroy arbitrarily and erratically the "fixed" structure/any agreement and won't adapt their earlier documentation, and I expect this will happen in future...

That is why I also imported the ECDC (for global data) and NY Times (for US data) as an alternative for JHU.

Thus I would strongly advise you to also use the ECDC derived data. (It uses exactly the same format as the JHU derived one.)

from covid19-datasets.

Just accidentally (when reviewing my inpspection-utility) I found some tiny issues... about covid19-datasets HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs