CoronaWhy Common Research and Data Infrastructure

What is CoronaWhy?

CoronaWhy.org is a global volunteer organization dedicated to driving actionable insights into significant world issues using industry-leading data science, artificial intelligence and knowledge sharing. CoronaWhy was founded during the 2020 COVID-19 crisis, following a White House call to help extract valuable data from more than 50,000 coronavirus-related scholarly articles, dating back decades. Currently at over 1000 volunteers, CoronaWhy is composed of data scientists, doctors, epidemiologists, students, and various subject matter experts on everything from technology and engineering to communications and program management.

What has CoronaWhy produced so far?

Read about our creations before you start.

CoronaWhy dashboards

Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
Task-Ties to explore transmission, incubation and environment stability
Named Entity Recognition across the entire corpus of CORD-19 papers with full text
Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review

More detailed information about every dashboard published on Kaggle.

CORD-19 preprocessing pipeline

Download COVID-19 Open Research Dataset Challenge (CORD-19) from Kaggle

bash ./download_dataset.sh

Start Jupyter by executing

docker-compose up

Jupyter notebook is running on port 8888, test CORD-19 pipeline by running commands:

docker cp ./tests covid-19-infrastructure_jupyter_1:/home/jovyan/
docker exec -it covid-19-infrastructure_jupyter_1 /bin/bash
pip install googletrans
cd tests
python ./cord-processing.py

It should produce v12* files in the same folder. File v12_sentences.json contains all extracted entities on sentences level corresponding to CoronaWhy Elasticsearch collection.

Follow all updates from our YouTube and CoronaWhy Github

Getting Started with CoronaWhy Common infrastructure

How to access Elasticsearch and Dataverse, notebook

CoronaWhy Elasticsearch Tutorial notebook

How to Create Knowledge Graph, notebook

Dataverse Colab Connect, notebook

GitHub dataset sync with Dataverse, notebook

CoronaWhy Services

You can connect your notebooks to the number of services listed below, all services coming from CoronaWhy Labs have an experimental status. Join the fight against COVID-19 if you want to help us!

Data repository

Dataverse deployed as a data service on https://datasets.coronawhy.org Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others.

Elasticsearch

CoronaWhy Elasticsearch has CORD-19 indexes on sentences level and available at CoronaWhy Search

Available indexes:

MongoDB

MongoDB service deployed on mongodb.coronawhy.org and available from CoronaWhy Labs Virtual Machines. Please contact our administrators if you want to use it.

Hypothesis

Our Hypothesis annotation service is running on hypothesis.labs.coronawhy.org and allows to manually annotate CORD-19 papers. Please try our Hypothesis Demo if you're interested.

OpenLink Virtuoso triplestore

We are providing Virtuoso as a service with public SPARQL Endpoint that offers an HTTP-based Query Service that operates on Entity Relationship Types (Relations) represented as RDF sentence collections using the SPARQL Query Language. https://virtuoso.openlinksw.com

You can run a simple SPARQL query to get some overview of triples from CoronaWhy Knowledge Graph.

Kibana

Kibana deployed as a community service connected to CoronaWhy Elasticsearch on https://kibana.labs.coronawhy.org Allows to visualize Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps. https://www.elastic.co/kibana

BEL

BEL Commons 3.0 available as a service https://bel.labs.coronawhy.org

An environment for curating, validating, and exploring knowledge assemblies encoded in Biological Expression Language (BEL) to support elucidating disease-specific, mechanistic insight.

You can watch the introduction video and read Corona BEL Tutorial if you want to know more.

INDRA

Indra will deployed as a service on https://labs.coronawhy.org/indra (in development).

INDRA (Integrated Network and Dynamical Reasoning Assembler) generates executable models of pathway dynamics from natural language (using the TRIPS and REACH parsers), and BioPAX and BEL sources (including the Pathway Commons database and NDEx.

Geoparser

Geoparser as a service https://geoparser.labs.coronawhy.org

The Geoparser is a software tool that can process information from any type of file, extract geographic coordinates, and visualize locations on a map. Users who are interested in seeing a geographical representation of information or data can choose to search for locations using the Geoparser, through a search index or by uploading files from their computer. https://github.com/nasa-jpl-memex/GeoParser

Tabula

Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. We deployed it as a CoronaWhy service available for all community members. More information at Tabula website.

Teamchatviz

We use Teamchatviz to explore how communication works in our distributed team and learn how communication shapes culture in CoronaWhy community. https://moovel.github.io/teamchatviz/

In progress

We are working on the deployment Neo4j graph database.

Articles produced by CoronaWhy people

I’m an AI researcher and here’s how I fight corona by Artur Kiulian

Exploration of Document Clustering with SPECTER Embeddings by Brandon Eychaner

COVID-19 Research Papers Geolocation by Ishan Sharma

antonpolishko / covid-19-infrastructure Goto Github PK

covid-19-infrastructure's Introduction