davidbau / covid-19-chart Goto Github PK

Chart of current COVID-19 time series data. Enables a variety of county- state- and nation-level comparisons and data exploration.

Home Page: https://covid19chart.org/

HTML 73.81% CSS 0.90% JavaScript 24.58% Python 0.71%

chart covid-19

covid-19-chart's People

Contributors

Stargazers

Watchers

Forkers

davidebbo veeps dmaymudes classicvalues

covid-19-chart's Issues

Plot ratios of deaths to confirmed cases. (discussion)

Everybody asks for this: what's the ratio of deaths to confirmed cases?

I am unconvinced that this ratio is a meaningful number (because both deaths and confirmed have huge sources of noise; and the time of death is shifted from the time of onset by a couple weeks, while meanwhile under exponential growth, during those two weeks the infection rate may have increased by 30-fold). A ratio will amplify the all these problems and my intuition is that such a ratio may dramatically undercount the actual peril, and will give people a very dangerous false sense of security.

At best, the number will be be almost uninterpretable.

But it is requested so often that maybe it should be an option for "advanced" mode.

Also some analysis have suggested "death divided by confirmed cases 2 weeks ago" is a better measure of something. Suggestions welcome.

Ability to show more states

LOVE what you've done! Very nice.

I may have to borrow your excellent ideas and make a version of my program that has additional views including a single graph.

I don't mind being overwhelmed with data. Rather be overwhelmed than not have enough.

Maybe provide another option to choose the number of states to graph? I don't know how my state is doing for example because it's not (yet) in the top 10.

Here's my graphs by country for example. I don't always want to see that many, but it's nice to be able to.

Normalize location names and formats between CDS and JHU

CDS is doing an increasingly good job... To merge data sources and to support things like population normalization, we need to normalize location names and data formats between CDS and JHU and us.

Our conventions (and conversions from JHU) are encoded in lib/flatten.js (exercised in test_csse.js). Each locality is either a country, "state", or "county", and is named as a suffix of"[county], [state], [country]" except in the US where we drop the country and just call it "[county], [state]". US States are all called by two-letter abbreviations.

And county names follow the JHU convention of not saying "County" at the end (maybe we should change this for clarity).

We should have

a set of functions to normalize locality names that can convert from CDS locations
a function to load CDS data into the same format as lib/flatten.js creates for CSSE data.
a function to merge CDS and CSSE timeseries data (and maybe note the provenance of each point meanwhile). Simple heuristic: take the maximum value reported by either source on any given day. Since reported stats are cumulative, this is robust.
then we need to change the plotting code to consume this flattened format.

[discussion] Add normalization option that takes area into account

Some context on Twitter: https://twitter.com/fearthecowboy/status/1248622602257489920

I'm not sure how interesting this is, but it's worth thinking about. i.e. chart the number of cases per sq km. Or maybe some fancy formula that somehow takes both area and population into account.

Thinking aloud, I'd expect that for a given population, the situation would get worse as the area gets smaller, as people are intrinsically closer together. e.g. to take extremes, DC and Alaska have similar populations but massive size difference. Of course, things get tricky because large territories tend to have their population in big clusters. e.g. more than half of Alaska's population in in Anchorage metro, while extremely large areas are completely unpopulated (and sort of irrelevant).

Anyway, let's discuss whether this is a direction that might make sense.

Animate between linear and log10 view

One of the goals of the visualization is to show how the logarithmic and linear views are different sides of the same story. Ideally, we could animate between the two views when switching.

Similarly, ideally we could animate between >= threshold and fixed-date views also.

This type of transition is common in D3 - but we are not currently built on D3.

Not sure how feasible it is while staying with chartist.js.

Animated and audio visualization of rates of infection or death.

The infection rate in the U.S. is getting high enough (19821 per day, which is about one every five seconds) that it can be visualized in real-time, and even listened to (esp global aggregates).

We should consider modifying the plot to communicate this:

(1) The next "point" should be estimated in real-time based on the rate from yesterday, or the last few days; as you look at the graph, it should be counting up. The unreported numbers should be shown with a dotted line to show they are estimates.

(2) Every estimated new case should be announced by an audible sound, to make it clear the significance of every individual being infected. Like the significance of a single bullet strike in a war.

Data cleaning: debug missing counties

County data is shown here:
https://covid19chart.org/test_county_map.html

Many counties are shown as missing data.
To debug this we should

(1) eliminate testdata/county_dates.csv and hook up the test directly to the live feed.
(2) track down whether the missing data is due to missing fips codes, or actual absent data in JHU feed

If the data is actually missing, should investigate if merging feeds will solve it - the CDS and NYT feeds are independently sourced and may have data JHU does not.

Idea: plot covid research papers published by keyword.

The virus is a biological meme, 30K of RNA base-pair sequences copied from host to host, randomly mutating in a gradual process to improve its fitness. The population-wide human response to a virus is also done by spreading memes. But here each meme is a piece of information about how the virus works and how to stop it, transmitted from person to person, processed and synthesized intentionally. The dynamics of the response seem pretty different.

So far we have only plotted time series for the virus "bad guys".

It would be interesting to see time series for the "good guys" - the researchers ideas circulating about understanding the disease and possible treatments.

Here is a dataset that contains the text of 63627 covid research papers so far. https://www.semanticscholar.org/cord19/download

I have not yet seen time series visualizations of this data. We could simply plot number of cumulative papers by keyword every day. Or we could plot daily appearances of words within paper text or citations.

Questions that should be able to be answered: How many papers a day are mentioning Remdesvir, HCQ, etc? What are this week's biggest percentage gainers?

Selected localities should be preserved when changing settings

With the exception of a Domain change, changing settings does not affect the legend. So ideally, changing non-Domain settings should not reset the legend selection.

i.e. I should be able to select US & France, and then tweak things like scale, start, confirmed/death, etc... without losing my country selection.

Better explanations for norm and start and other hidden options.

Currently there are various options that are not clearly explained in the UI. With the current interface it is hard to expect people to

know what the "norm" menu does.
know what the ">=100" menu does.
know you can see county-level data.
know you can directly compare a county to a state to a country.
know that you can click on the colored squares.

We could arrange the select options to grow to show longer names when dropped down, which might help with issues (1) and (2). We could add more explanatory text.

Plot number of tests done.

As requested here, testing data is helpful to know:
https://twitter.com/O_2the_L/status/1245114253524307969?s=20

Some states report number of tests done, so tests done in a locality could be plotted on a time series in some cases.

Available in the CDS data, not from JHU, so depends on issue #2.

raw.githubusercontent is down.

We depend on the JHU CSSE feed being served by raw.githubusercontent.
That website is currently down.

We should not depend on this directly. We should

Load indirectly through a caching CDN. jsdelivr provides URLs that cache raw.githubusercontent. Unfortunately, we can't populate that cache now because the originals are down.
Load in parallel the CDS feed, and merge, as issue #2 describes.
Serve a backup of the data on another domain.

These are hard to fix quickly - it looks like github pages is also not propagating at the moment.

Selection of top N ignores norm=pop

We currently pick the same "top N" entities based on total count, rather than using per-population count if that's what's being graphed.

Add autocomplete locality-by-name picker

Use some autocomplete interface to make it possible to easily add county/state/country-level series when you don't know exactly which ones are available.

A widget like this https://autocomplete.trevoreyre.com/#/ could help.

Possible design: after a graph is plotted, we populate the widget with the locality names that are actually nonzero (i.e., not all 3000 counties at first). Selecting one will add it to the "include" list. Maybe add a reset button to clear them.

Add by-state or by-county mapping picker.

Should have a selectable map view for LHS, where an SVG U.S. map is used to show states or counties.

Should support panning and zooming to be practical for clicking counties.
Should allow adding and removing counties to the plot by clicking geographically.

When hovering on a date on the graph, ideally the current date timeseries stats should also be shown as a chloropleth.

Which data file(s) are being used to generate the stats?

I would like to add onto the desktop application I built for worldwide data to include data about the individual states.

The CSV file I'm using for my current code is this time-series file:
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

The entire dataset is in the single CSV file.

For this project's data, are you aggregating the data that is contained in the individual daily reports?
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

It seems like that's the only way to get data at a state level.

Reset button should also reset select state

Currently the reset button reset the "include" state. But it's odd to click reset and see it show a small subset of the default series.

Reset should probably reset the "selected" state also, so the plot can go back to default settings.

Add by-county rollups.

The point of covid19chart.org is to make it easy for local officials to see and compare what is happening in relevant localities. To deliver on this, we should add by-county rollups.

Ideally:

internal data should include FIPS code for a county on every row that has one. CDS and old CSSE data should be translated based on cleaned county names. lib/usa_counties.js might be useful.
initial user interface can be by typing in county,ST names into the "include" box, e.g., "Middelsex,MA;Suffolk,MA".
then later we should do a D3 by-county visualization and picker.

Merge data from both JHU CSSE and CDS feeds

We should also load the data from an alternate data source so that the system is more robust.

This one looks good:
https://coronadatascraper.com/timeseries-byLocation.json

Ideally we would always load from both CSSE and CDS and merge the data (e.g., take the max cumulative for a day when both report?) And ideally we would be robust to either data source going down or publishing un-parseable data.

I have factored load_csse.js as a first step.

Add 7day as an alternative to total and delta

The daily deltas are very noisy. A 7-day sliding window delta would be a smoother signal, and probably worth adding as an option.

It would be ideal to change the way the subtraction is done so that the 1st 6 days are not left blank on the visible plot.

Related to #12, which will work better with smoothed data.

Might be interesting to @davidebbo

Changing domain can result in an empty chart.

@bleroy reported going to https://covid19chart.org/#/?stat=7day&scale=linear&include=WA&top=0&series=deaths&start=3%2F1%2F20&ratio=&advanced=1 and getting a broken graph on a OnePlus 6T.

May need to get more details.

Hovering over legend highlights i+26 as well as i.

Hovering over legend can highlight multiple series when there are more than 26 shown. This is due to the recycling of class names in chartist.

To see the effect, visit https://covid19chart.org/#/?top=53 and hover over the legend.

Build population out of JHU CSSE table

CSSE added a population column in this table a few days ago; we should use it as our source for population.js instead of CDS

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv

Selected localities should be preserved on the query string

This would make them persistent when refreshing that page, and would allow sharing permalinks with a subset of localities selected.

Consider changing the default view away from total cases

Maybe there's no perfect answer for everybody, but I think changing the default view to

start=4%2F1%2F20&stat=7day

might make sense.

I also tend to switch to "scale=linear&norm=pop&top=25"

at least that last shouldn't be necessary once #56 and #57 is fixed.

Apply start rule to population-normalized data (not unnormalized)

I'm not 100% sure about this, but...

When viewing population-normalized data, a >=N start rule should apply to the normalized series, not the unnormalized ones. That is, it should be interpreted as ">=N per-million".

Why?

This would make the normalized plots line up at the bottom-left, as they do in unnormalized mode.
More substantively: I assume that the role of the ">=N start rule" is to imagine giving different countries/states the same "initial condition", to see how their paths diverge from that point. When you're in the per-million view, you are adopting a model where the "condition" of a country/state is the proportion of the state with confirmed cases or deaths, so equal proportions should be your definition of equal "initial conditions".

I'd be happy to make the change, if you felt that it would be helpful! (I would also adjust the ">=N" options to be more appropriate to "/pop" mode, when it is selected.)

Thanks a bunch.

Chart doesn't show locality name when choosing an absolute start time

With relative starts (e.g. >= 30), the locality is displayed on the right side of each line. For some reason, this doesn't happen with absolute start times (e.g. 3/23/2020).

Fix UX for first added locality

Suggestion from Jennifer Frazier
https://twitter.com/frazierarchive/status/1245172300200067072?s=20

Hi! I'd love it if your data (county, zip code) was overlaid on the more familiar "baselines" (US, NYC, etc). Then you can do more of a comparative study. Yes - clickable boxes so you could add/toggle views.

We can do this by implementing the following state machine.

Search box. There should be three cases for the behavior of the search box, to optimize for usability after the first item is added.

If it's the first new locality added, and there are no prior legend selections, it is added and made the only selection, so the user can see the added line alone, but also 10 empty easy-to-toggle buttons for other top-10 localities that can be clicked back in easily.
If there is already a selection, we don't change it but make sure the new locality is included and added to the selection so the new line is visible.
If it's not the first added but there is no special selection, we keep everything visible by not changing the selection.

Now the only problem is that in the common case, there are lots of empty boxes for unselected things cluttering the view even when the user does't want them. So also we change the reset button: The reset button to now have two roles

If there is a legend selection (some empty colored boxes) it is now the "clean up" button, cleaning up the graph to focus on just the items selected in the legend.
If there is no selection (all the colored boxes are full) it is the "reset" button, going back to the original default view of top-10 localities.

Set default starting threshold at >=80 instead.

Alan Warren wrote to observe that the >=30 starting threshold visualizes some weird artifacts having to do with Diamond Princess accounting. Basically, more than 30 passengers were flown to the US in mid-Feb, and JHU starts counting these as US cases on 2/23 (arbitrarily), which means that's the date we start counting the US as passing the threshold.

I do think these US cases should be included and plotted, but I agree that that part of the plot shows the effects of a different policy regime where every case was being well-separated and scrutinized, unlike today.

Maybe it would be clearer to have default starting threshold should be >=80, which skips past the time period with these differences. All the countries and states have a day where they have (logarithmically) a bit more than 80 known cases, so it's a good tightly-clustered threshold.

Experiment with log-log plot of recent changes

Idea from here, https://www.youtube.com/watch?v=54XLXg4fYsc

This idea should probably be plotted on a different webpage.

The idea of this plot is to understand whether a locality is still on the exponential domain or if we have succeeded in leaving it (or if we are resuming it). The idea is to plot log(recent changes) on the y axis versus log(less-recent changes) on the x axis. While we're on exponential growth, this will be on the line given by the exponent, but will quickly depart when not.

The youtube video described graphing log(weekly change) vs log(total cumulative) - but maybe having both sides be a delta would be able to show when (e.g., in Japan) exponential growth restarts after the society resumes social behavior too early.

Allow selection of different sources: Coronadatascraper, JHU

Thanks for the very useful tool!

Coronadatascraper has better data for California; it matches that from the local health departments. I'd love to have the option of selecting CDS as a data source (instead of JHU). I see that it has a CSV file in JHU format. I haven't compared the files in detail; I'm sure that there are some minor differences. Even so, I confess I trust CDS (at least for California) more.

thanks!

Hovering over a legend should highlight the touched series

The chart can get very crowded and difficult-to-read, esp in log threshold mode where all the time series are (informatively) drawn right on top of each other. But to see the locality that you care about in the pile, it should be made visible when hovering over the legend. (In the below - which is FL vs LA or CA vs WA?)

We should consider adding a legend hover event that:
(1) Dims all the non-hovered lines, e.g., adding an opacity < 1 to them.
(2) Brings the hovered line to z-index 10 to be in front.
(3) Makes the label on the hovered sequence more contrasty and more visible (might require looking at theme css).

Advanced option to normalize log >=100 plots by first day

Idea from Alan Warren.

Add an (advanced) mode so that the log >=100 plot can normalize all series so the first day is 1.0.

Currently whatever threshold >=80, >=100 etc you choose, the first day has a large artificial vertical offset since the daily growth rates are so high. E.g., first day in Michigan >= 100 is already 334, so the whole Michigan line floats over the others, even with a similar slope.

One solution is to normalize everything by that day, i.e., report "Total cumulative cases (log scale, normalized to first day)." This should be an option, at least in advanced mode.

Spanish localization

In the U.S. there is a substantial primarily Spanish-speaking community. We should factor the page to support localization, and then have a spanish-language localization version.

The second largest non-English community is Chinese.

Selection of top N could be improved

NY is chosen as the "top state" even if a graph starts on 6/1:

https://covid19chart.org/#/?advanced=-4&start=6%2F1%2F20&top=1&stat=daily

when from

https://covid19chart.org/#/?advanced=-4&start=6%2F1%2F20&top=10&stat=daily

it clearly shouldn't be.

new metric idea: ratio of current 7-day growth vs. previous 7-day growth

Now that total cumulative cases are a less-useful metric, could it make sense to try to estimate current spread by computing the ration between this week's growth and the previous week's?

After legend selection, searching for a new location results in a blank graph

Repro:
(1) Default view, US states.
(2) Select NY.
(3) Enter "RI" into the search box.
Observe: RI is unselected, no graph line.

Suggestion: maybe instead of resetting selection when domain changes, reset selection when nothing selected is included in the plot.

Add option to plot statistics in terms of population density.

We should add an option to plot statistics in terms of population density.

This might reveal, for example, the terribly high rates of infection in Colorado springs, which are not revealed with currently graphed stats.

The coronadatascraper feed comes with by-county population information, and so could be done as part of #2 and #3.

davidbau / covid-19-chart Goto Github PK

covid-19-chart's People

Contributors

Stargazers

Watchers

Forkers

covid-19-chart's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs