cmu-delphi / delphi-epidata Goto Github PK
View Code? Open in Web Editor NEWAn open API for epidemiological data.
Home Page: https://cmu-delphi.github.io/delphi-epidata/
License: MIT License
An open API for epidemiological data.
Home Page: https://cmu-delphi.github.io/delphi-epidata/
License: MIT License
Currently, data ingestion accepts the following geographical types:
We currently have an indicator waiting in the wings whose ideal geographical type is HHS region. To begin serving it from the API, ingestion needs to know how to validate HHS region ids.
Further reading:
the python client integration test consists of hitting the covidcast endpoints. with #67 , the covidcast integration tests will use the python client to hit the server. these are equivalent, and redundant.
it might make more sense for the client integration test to validate non-source-specific behavior (e.g. handling of lists and ranges, etc) instead of particular endpoints.
One issue raised repeatedly about the COVIDcast map is that the color scale is based on the minimum, maximum, and standard deviation of the entire signal history. If a signal has changed significantly over time, the map can hence be poorly scaled.
Figuring out how to adjust the scale dynamically when the user switches days is one problem; but before it can do that, the covidcast_meta
endpoint would have to provide metadata about a specific date or date range. For example, if we said the scale is based on the variation over the past month, we'd need to be able to request the scale for 2020-04-20 and get the variation over the preceding month.
Unfortunately this would break the caching strategy we currently use, so it also remains to be seen whether there is an efficient way to do this. I wonder whether the strategy might end up being "just stick Varnish in front of the API server" instead of a clever caching system, but I don't know what will prove best.
Hello,
This is a question about understanding the data presented for covidcast endpoint. I'm a newbie to this api and trying to understand the data.
I'm trying to read daily data of us counties (fb-survey) and the results I'm getting from the api are value
and sample_size
. There's not much information available about what does 'value' indicate.
The website https://covidcast.cmu.edu/ shows us map with percentage
s. I did not understand what percentage
are we looking at and how to get percentage
from value
?
Hello there!
Thanks for this amazing package. I would like to know if I can use delphi-epidate
to retrieve the original, unrevised ILI rates estimates available during a given week (at the HHI or State level).
That is, the equivalent of https://www.cdc.gov/flu/weekly/weeklyarchives2013-2014/data/senAllregt08.htm but at the HHI or state level (instead of at the national level).
Is this information available through this package?
Thanks!
One of the recent direction runs took 90 minutes to complete, which seems excessive.
I suspect we are doing something that was easy to implement but not terribly scalable, like updating the direction for all dates and all geo_ids when only a small number of them were invalidated, and perhaps also using a separate query for each time series with updates (there were 72546 of those in the 90-minute run).
If it turns out the database code isn’t the issue, we might have to think about moving the direction computation outside of epidata and into the individual indicators.
e.g. using https://fastapi.tiangolo.com/ which is a nice layer, ready for production and good integrations as well as testing capabilities.
atm. each result is wrapped in a {result, epidata, message}
construct. It would simplify things if there is a mode in which the array is directly returned as a flat list.
result and messages could be transported using regular http codes (e.g., https://httpstatuses.com/401 not authorized), or custom http response headers (e..g, has_more flag).
Especially since county-level responses are super large.
Hi there!
My name is Bin and I work for Verily Life Sciences. I graduated from CMU (SCS 2013) so I'm very glad to see the news today that Delphi and FB are working together on the self reported symptom survey map:
https://covid-survey.dataforgood.fb.com/
I have then played with the API listed here. My main question: Is this API the same as that supports the FB map? For example, this API only gives me a subset of the zipcode:
https://delphi.cmu.edu/epidata/api.php?source=covidcast&data_source=fb-survey&signal=cli&time_type=day&geo_type=county&time_values=20200406-20200410&geo_value=*
As an example, the result does not contain Denver area (80xxx).
Thanks in advance for your help!
problem: all the db queries resulting from an api request include a LIMIT
that effectively truncates the results returned when over a fixed limit.
impact: as the data and queries for that data grow, this may mean that customers won't always get all the data they expect, and have no way to get what they missed. this is exacerbated by the lack of guarantees around the order of data they get, meaning they might have holes in the data without realizing it that might impact the validity of how they use the data.
proposal: add support for pagination to the api. example reference: https://www.allphptricks.com/create-simple-pagination-using-php-and-mysqli/
The current API clients, in src/client/, are pretty generic and apply to the whole Epidata API. It would be great to build on these to have more fully-featured COVIDcast clients. Specifically they should include:
get_daily_data_df
: return an R or Pandas data frame for a specific signalWe should aim for R and Python, since those will be the most common use cases.
The API is very inconvenient for one of our users because they don’t use R or Python and they’re literally running API queries manually, then running the JSONs they find through online converters to get CSVs.
For now, we can put up a python server somewhere and have it do the transformation as a middleman, to make their workflow a little less precarious.
Long-term we should consider supporting CSV formatted output directly. What might make it tricky is the tight integration with the rest of Epidata, because this is how api.php
currently ends:
// send the response as a json object
header('Content-Type: application/json');
echo json_encode($data);
?>
This is a more serious problem with larger data imports, but even for small updates it can interact poorly with automated jobs, causing them to spuriously fail. It's not clear why this happens, and testing it outside of production may prove tricky.
It's particularly bothersome for us at the moment, because the COVIDcast indicator pipelines depend on API calls for validation; when an automation job halts due to this issue it often means losing work (though not losing data, to my knowledge), which then requires human attention to fix and re-run manually.
it is is a common format for defining date data both for parameters and return values.
filtering options:
would reduce the metadata size from around 98kB to 15kB for the current signals used in the website.
delphi-epidata/src/client/delphi_epidata.js
Lines 473 to 477 in 8921bd0
Hello, I was looking at the official https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html and my understanding is that only the final ILI values are reported.
Despite one's best efforts, sometimes there are mistakes or parsing errors. How can I check that the historical (those that are revised over time) ILI values are correct when using delphi-epidata
?
I am thinking about running some manual checks for a few random dates.
Thanks again for this great API!
problem: the python client integration tests currently hit the prod api since the endpoint is hardcoded into the client.
impact: lack of isolation. the success of the tests depend on whether prod is up and working. also, the tests unnecessarily add test load and risk to prod.
proposal: make client integration tests hit the local docker web server instead of the prod api so as to provide some isolation and make things more self-contained.
Hello, thanks for all your work here. I was wondering if there is information on the source and nature of the consensus
data source and signals in the COVIDcast API. I haven't found it anything across any of the repos and websites.
UCSF runs COVID19 Citizen Science (https://covid19.eurekaplatform.org), which also collects daily syndrome data. It's a much smaller data source, but we appear to have more granular symptom data, and would like to consider contributing. I am one of the PIs (not a developer), so pardon my ignorance. Questions:
The covidcast_meta
endpoint takes ~10 seconds to respond, and this latency will increase as the covidcast collection grows. To enable visualization on the web, it's possible to fetch a cached version of the metadata by using an optional parameter cached
, which takes only ~200 ms.
Compare:
Exposing the cached
parameter in client libraries would be useful for developers using the programmatic interface to the API.
Some notes:
covidcast_meta
endpoint will not return cached data that's older than 75 minutes; instead it will gracefully fallback to the live, non-cached dataMost COVIDcast signals come in two flavors: smoothed and raw.
Considering cmu-delphi/www-covidcast#TBD, it might be nice to explicitly combine raw+smoothed pairs of signals instead of publishing them separately, and provide them as an extra column in the API response (even if they are still stored separately in the database).
This would also help with cmu-delphi/covidcast-indicators#67 and cmu-delphi/covidcast-indicators#36, since it would permit us to display the raw signal in time series charts without interfering with map coloring. The raw time series does not display the misleading sawtooth pattern.
Since moving to a data versioning scheme, there is no longer any way to remove a row from COVIDcast without removing all previous versions of that row as well (so that it's as if it was never published at all). This is hazardous -- leaving the row in is inaccurate, and removing the row gives forecasters access to future-privileged information that will not match realtime usage.
We are developing a survey of different kinds of missingness and deletions that occur in the different COVIDcast sources here to help spec out an encoding system.
Some additional conversation on this is in a thread on the first set of performance fixes, but it looks like the column additions mentioned there didn't actually make it into staging this time around.
The database has a 32-character limit on the length of signal names. This is irregularly enforced, and it causes Automation to fail when the limit is transgressed.
Signal names are stored in the data_stdevs[source][signal]
dictionary exactly as they are read from an ingested filename. When they are inserted into the database, they get truncated to 32 characters. The "Compute Missing/Stale Covidcast Direction" job reads signal names out of the database and expects to find them in the data_stdevs[source][signal]
dictionary, but the truncated names are not listed there. The job fails with a KeyError:
Traceback (most recent call last):
File "/home/automation/.pyenv/versions/3.4.10/lib/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/home/automation/.pyenv/versions/3.4.10/lib/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/automation/driver/delphi/epidata/acquisition/covidcast/direction_updater.py", line 189, in <module>
main(get_argument_parser().parse_args())
File "/home/automation/driver/delphi/epidata/acquisition/covidcast/direction_updater.py", line 177, in main
update_loop_impl(database)
File "/home/automation/driver/delphi/epidata/acquisition/covidcast/direction_updater.py", line 137, in update_loop
data_stdev = data_stdevs[source][signal][geo_type]
KeyError: 'wip_confirmed_7day_avg_cumulativ'
We should either extend the character limit in the database schema (perhaps at the same time we add in issue dates), or truncate the signal name after it gets read out of the filename and before it gets loaded into data_stdevs[source][signal]
.
https://www.sqlalchemy.org/ provides a nice level on top of any python DB connection and there are several libraries on top of it.
I appreciate what you’ve done and continue to do in epi-forecasting. But it seems to me like the problem in front of us is not so much lack of information as is it is lack of spine. I think epidemiologists need to join together and say that we’re (our governments are) doing the exact worst thing. I mean is there an epidemiologists out there who thinks that the current approach is appropriate? It seems to me like anyone with expertise and foresight must believe that we’re either doing too much or not enough, and the shift in either direction would be better than the status quo.
What’s the plan?
•••••••••••••••••
Plan? MY view: We MUST either step up and tighten (w/quarantines AND/or testing) our state and/or country borders AND get serious about tracing AND other steps to wipe it out OR allow a controlled spread, by unlocking, from youngest to oldest as that is the fastest approach to restoring some kind of normality. politician approval rates are terrible for good reason. I HATE our empowered politicians, who are (AFAIK) ALL too spineless to do anything but think unbelievably short term. So we’ve got the worst of both worlds. We’re not wiping out the virus AND we’re destroying the economy. We’re doing the one thing stupider than opening up the economy further, which is not tightening it up enough to wipe out the virus and keep it out. As Mr. Rogers, said I’m so angry I could bite. I don’t hear anybody asking our president or governor or mayor what the long-term plan is!!! We/ Our media must ASK them and more and demand answers and report! Leaders and health officials seem too ignorant or power-hungry to understand how a privacy-preserving app could be very effective; their myopia is maddening!
Please forgive me if i’m unable to keep entirely to the issue system guidelines a wee bit here. I just had to get this off my chest. No I’ve never done this before, and expect I won’t again. Crazy times. You can just close this. Please don’t delete it. Or do, but a friendly word would really mean a lot right now. Thanks 🙏.
Some of the URL links in README.md are broken.
Hey, noticed this morning that https://delphi.cmu.edu/epidata/api.php?source=covidcast_meta is not responding, has anything changed with the way to access this endpoint?
Have been using the sample string provided at: https://cmu-delphi.github.io/delphi-epidata/api/covidcast_meta.html
Thanks!
We occasionally want to remove a span of data from the covidcast API with the following constraints:
For the moment, we're just sacrificing the first constraint and having an admin do a DELETE FROM
, but it would be less strain on the rapidly-shrinking devops team if sensor groups could do this themselves through the existing ingestion infrastructure. Maybe a magic value
?
This would require the API serving routine and the metadata generator to be aware of whatever we design, so that the relevant spans are excluded.
None of the signals update more than once a day, so we could get a substantial performance boost in the map if we allowed caches to stay good for a few hours.
Hello,
I was wondering if its possible to put the client code as packages on corresponding language repositories (pypi. npm and cran). That would make using/setting them easier than the current method where users need to pull in updates to code manully.
I am not certain of the self reporting parameters you use in this "data" gathering exercise. Usually, if the data environment is not closely controlled - the very essence of self reporting - the data collected are between nearly useless and totally useless. Let's take Navajo and Apache Counties in Arizona in mid April. We are in such a high pollen environment, we get pollen danger notifications on our smartphones. This is mostly for the respiration challenged and the allergic but it is useful to all. Symptoms include difficulty breathing, mild to severe sinus headaches, persistent dry cough, occasional mild fevers, etc. Sound familiar? How many of those people have responded with positive covid symptoms out of simple fear that their allergic response is covid 19? What we need rather than more of this near useless "data" is massive nationwide testing. Perhaps Zuckerberg could throw a billion or two at that problem. Dr. Ronald L Rabie, Los Alamos National Laboratory, Retired.
problem: the auth
parameter is optional for the sensors
api:
delphi-epidata/src/server/api.php
Lines 1309 to 1314 in c48af8f
however, the clients are all incorrectly requiring the auth
parameter to access sensors
, e.g.:
delphi-epidata/src/client/delphi_epidata.py
Lines 459 to 463 in c48af8f
impact: customers of the clients will be incorrectly restricted from using the sensors
source without auth
.
proposal: make auth
optional for sensors
in the clients.
problem: fluview_clinical queries seem to always return duplicate results, even accounting for the same release date.
from dfarrow0@ offline:
I checked, and can confirm they are duplicated in the database. I think is is not supposed to be this way. I suspect problem with unique key constraint.
impact: if customers don't expect duplicates, their usage of the data may be incorrect.
proposal: if the duplicates are unexpected, fix the bug causing them. if they are expected, update the documentation for this source to make it clear to customers.
example: https://delphi.midas.cs.cmu.edu/epidata/api.php?source=fluview_clinical®ions=nat&epiweeks=202001 currently returns:
{"result":1,"epidata":[{"release_date":"2020-04-13","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-16","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-11","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-14","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-12","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-15","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-10","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-14","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-11","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-15","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-13","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-16","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-10","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-14","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-11","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461},{"release_date":"2020-04-15","region":"nat","issue":202014,"epiweek":202001,"lag":13,"total_specimens":64980,"total_a":5651,"total_b":9647,"percent_positive":23.5426,"percent_a":8.69652,"percent_b":14.8461}],"message":"success"}
notice, for example, the row for release_date 2020-04-14 occurs 3 times.
Hey there!
So I can use Python library to pull the ght data by state:
delphi_epidata.Epidata.covidcast('ght', 'smoothed_search', 'day', 'state', [delphi_epidata.Epidata.range(int(start_date), int(end_date))], 'CA')
However, pulling by county doesn't return any result:
delphi_epidata.Epidata.covidcast('ght', 'smoothed_search', 'day', 'county', [delphi_epidata.Epidata.range(int(start_date), int(end_date))], '08013')
// Previously i tried '*' to pull all county data from the fb signal, which doesn't work here either.
Could you please help on what is the correct API call? Thanks!
e.g., when querying for a specific geo_value
, one might want to exclude the geo_value
from the response to save space. like an optional field in which the users defines the list of fields to return.
The scraper raises an exception for public health lab data saying that the header row has changed. I temporarily disabled scraping here.
Stack trace (some info stripped):
Traceback (most recent call last):
[...]
File ".../epidata/acquisition/fluview/fluview_update.py", line 541, in <module>
main()
File ".../epidata/acquisition/fluview/fluview_update.py", line 538, in main
update_from_file_public(issue, date, filename, test_mode=args.test)
File ".../epidata/acquisition/fluview/fluview_update.py", line 390, in update_from_file_public
data = [get_public_data(row) for row in rows]
File ".../epidata/acquisition/fluview/fluview_update.py", line 390, in <listcomp>
data = [get_public_data(row) for row in rows]
File ".../epidata/acquisition/fluview/fluview_update.py", line 267, in get_public_data
raise Exception('header row has changed for public health lab data.')
Exception: header row has changed for public health lab data.
Please take a look at your convenience.
problem: after ingestion of a covidcast
file, the file is archived. if another file already exists in the archive with the same name, that file is overwritten, whether the ingestion succeeded or failed.
impact: while this may be acceptable for failed ingestions given the current logging, i am a little concerned about the potential silent overwrite in the case of success. for example, if you somehow get a truncated version of a file that was already ingested, having both versions archived could be useful for both determining the extent of the problem and for quickly back-filling the lost data.
proposal: perhaps consider adding a timestamp to archived successfully ingested files?
Originally posted by @pedritom-amzn in #70
problem: the api for the wiki
source now requires the language
parameter:
delphi-epidata/src/server/api.php
Lines 1204 to 1205 in c48af8f
however, while the python client supports it, none of the other 3 clients (R, js, coffee) do. for example:
delphi-epidata/src/client/delphi_epidata.R
Line 240 in c48af8f
impact: customers won't be able to use these three clients for accessing the wiki
source.
proposal: add the language
parameter as a required parameter for the wiki
source in these three clients.
the covidcast integration tests should use the epidata client now that it supports the covidcast endpoints
We never mention source:
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#constructing-api-queries
The proposed tree format for multi-signal queries is really a group-by. We should consider permitting a group_by parameter that would generate this structure for any parameter that accepts multiple values (time_value certainly; geo_value might hurt).
On https://cmu-delphi.github.io/delphi-epidata/api/, the link to the definition of epiweeks ("this page") is broken.
In theory the integration tests should have caught this typo; in practice everything passed and then promptly crashed in production. We need to
There is some failure case that's not being adequately logged:
handling /common/covidcast/receiving/jhu-csse/20200828_state_deaths_7dav_cumulative_prop.csv
deaths_7dav_cumulative_prop False
archiving as failed - jhu-csse
Correct handling of a failed CSV is logged like this:
handling /common/covidcast/receiving/jhu-csse/20200828_county_deaths_incidence_num.csv
deaths_incidence_num False
invalid value for Pandas(geo_id='.0000', val='7.0', se=nan, sample_size=nan) (geo_id)
archiving as failed - jhu-csse
A Kramdown vulnerability came out over the weekend:
https://github.com/cmu-delphi/delphi-epidata/network/alert/docs/Gemfile.lock/kramdown/open
I attempted to use the auto tool to update it, but dependabot wasn't able to find its way out of a dependency conflict.
I am trying to retrieve a lot of data from the DELPHI API, but am having trouble getting all the data I am requesting. Is there a limit that I am running into? For example, I run
source("https://raw.githubusercontent.com/cmu-delphi/delphi-epidata/master/src/client/delphi_epidata.R")
res <- Epidata$fluview(regions = list("nat", "hhs1", "hhs2", "hhs3", "hhs4", "hhs5", "hhs6", "hhs7", "hhs8", "hhs9", "hhs10"),
epiweeks = list(Epidata$range(199740, 201653)),
issues = list(Epidata$range(199740, 201653)))
df <- do.call(rbind, lapply(res$epidata, rbind))
And I end up with data from only regions "nat" and "hhs1" and only until epiweek 200450. The total number of rows in the resulting dataframe is 3650.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.