This GitHub repository contains a downloadable snapshot of National Institute of Standards and Technology's COVID-19 Data Repository, curated from the COVID-19 Open Research Dataset (CORD-19) provided by the Allen Institute for AI.
The COVID-19 Data Repository provides searchable CORD-19 data and metadata, including full-text extracted from the original CORD-19 JavaScript Object Notation (JSON) files and entities identified using the en_ner_bionlp13cg_md NER model trained on the BIONLP13CG corpus. It is built using the Configurable Data Curation System (CDCS) developed at NIST
The purpose of this repository is to provide a platform-neutral means for bulk downloads of curated COVID-19 data. These downloadable archives are versioned using GitHub Releases, based on the Data Repository's schema and time-stamped archival dates, making programmatic access to the latest data (or, consistent dependency management for reproducibility) much easier for users.
To download, head over to the releases page and select a desired release and zip-archived format, or simply download the latest JSON, XML, or CSV versions at those links directly.
To further facilitate rapid interface and reproducible data science work-flows, this repository builds data packages that can directly interface with common statistics languages, usable through separately installable libraries that assemble data and tools for analyzing the CORD-19 data in one, convenient place:
Language | Repository |
---|---|
Python | cv-py |
More languages are certainly possible, depending on community need. Data packages can be downloaded directly from this repositories releases page, or through instructions found at the language-specific repositories above. More information can be found at the readme inside each language-specific <lang>-interface
folder.