GithubHelp home page GithubHelp logo

cord19-cdcs-nist's Introduction

Curated Archive for COVID-19 Research Challenge Dataset (cord19-cdcs-nist)

This GitHub repository contains a downloadable snapshot of National Institute of Standards and Technology's COVID-19 Data Repository, curated from the COVID-19 Open Research Dataset (CORD-19) provided by the Allen Institute for AI.

The COVID-19 Data Repository provides searchable CORD-19 data and metadata, including full-text extracted from the original CORD-19 JavaScript Object Notation (JSON) files and entities identified using the en_ner_bionlp13cg_md NER model trained on the BIONLP13CG corpus. It is built using the Configurable Data Curation System (CDCS) developed at NIST

Downloading the Data

The purpose of this repository is to provide a platform-neutral means for bulk downloads of curated COVID-19 data. These downloadable archives are versioned using GitHub Releases, based on the Data Repository's schema and time-stamped archival dates, making programmatic access to the latest data (or, consistent dependency management for reproducibility) much easier for users.

To download, head over to the releases page and select a desired release and zip-archived format, or simply download the latest JSON, XML, or CSV versions at those links directly.

Data Packages

To further facilitate rapid interface and reproducible data science work-flows, this repository builds data packages that can directly interface with common statistics languages, usable through separately installable libraries that assemble data and tools for analyzing the CORD-19 data in one, convenient place:

Language Repository
Python cv-py

More languages are certainly possible, depending on community need. Data packages can be downloaded directly from this repositories releases page, or through instructions found at the language-specific repositories above. More information can be found at the readme inside each language-specific <lang>-interface folder.

cord19-cdcs-nist's People

Contributors

dima20899 avatar pdessauw avatar rtbs-dev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

pdessauw dtrinh62

cord19-cdcs-nist's Issues

Object ID

Currently assigns the PID as global object indexer in dask, before export to parquet. This turns out to be non-unique.

Need to implement globally unique index.

PID Cleaning

Currently every instance of pid starts with https://127.0.0.1/pid/rest/local/cdcs/, which is likely unnecessary?

Documentation

Tracking Documentation needs for this repo

  • crosslinks to related projects/repos
  • contribution guidelines
  • build guide
  • licensing

CDCS schema conflict

Current archive is missing some fields from the cdcs schema, specifically author_link.

Perhaps we should consider migrating to xmlschema to perform the data transforms at build time (with validation).

Migrate to dask bags for python loading

Currently we flatten the underlying nested json into a dask dataframe, squashing stuff into individual columns.

A dask bag would preserve the original structure, as validated by the XSD file, which could be flattened downstream (i.e. in cv-py, via a data class, etc).

Test release

For testing purposes in other apps, it would be useful to have a "dummy" release that is much smaller.

E.g. in cv-py I interface with the releases page, and I'm finding need to automate installs from these releases at build time to verify everything works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.