The biotech-consulting-project from mzcolor001

Introduction

Biotrain is a querying and aggregating platform for showing the medication impacts on patients with cancers. It is a data pipeline that enables researchers and doctors to trace the medications being taken by the patients, and obtain the experiments being run on them and its corresponding sequence data. The informatoin gained from the sequence data are displayed in a dashboard so that it could help researchers and doctors to decide further treatment for the patients and what kind of organisms is crucial for curing certain cancers and also for preventing cancers.

Biotrain uses data from Persephone Biome to serve the application. The data is stored in GCP buckets, and Python script to clean and extract the patient information, and generate the JSON files for differnt experiments run on the patients. MongoDB is used for aggregating the patient info., experimental runs and analysis results with patients. Pymongo is used as a MongoDB driver to talk to the database. Dash is used to build the front-end so the user could query their desired information. Apache Airflow is used to automate the above pipeline and send the monitoring information to the user.

Components

Data Production

The data resides in the GCP is read into the local server through file and the data size is about 30Gb, the file hierachy is organized below:

├── Persephone
          ├── metadata_sanguine_Cancer.xlsx
          ├── metadata_sanguine_Controls.xlsx
          ├── metadata_sanguine_Cancer-human.xlsx
          ├── metadata_sanguine_Controls-human.xlsx
          ├── MiniSeq Submission Sheet.xlsx
          ├── wgs11N
                ├── analysis
                │      └── centrifuge
                │ 		      ├── 05_S1
                │           │     └── kreport.tsv.gz
                │           │     └── report.tsv.gz
                │           │     └── hits.tsv.gz
                │           │...
                │           │...
                │           │...
                │           ├── 64_S45
                │           │     └── kreport.tsv.gz
                │           │     └── report.tsv.gz
                │           │     └── hits.tsv.gz
                ├── trimmed
                        └── 25_S6_R2_001_val_2.fq.gz
                        └── ...
                        └── ...
                        └── ...
                        └── 17_S3_R1_001.fastq.gz_trimming_report.txt

Data Storage

Schema

Given the possibility of the frequent schema changing, MongoDB is chosen to resolve this problem. Also given the complexity of the data format, one-to-many relationships and the size of each record (some of the files > 2GB), the model tree is adopted insted of the embedded documents. Please refer the schema below.

Aggregation

There are five collections which contain the patient info., sample info., experimental run, experimental result and analysis result in the database. Connecting all five collections was tried, and it didn't work with this case. MongoDB doesn't provide multiple collections' join. This aggregation is done through this script.

Front-end

The analysis result is read through MongoDB and displayed with Dash. The user is able to search the desired patient from the dropdown and query and display the patients' information.

Data Monitoring/Scheduling

In order to automate the whole process mentioned above, Apache Airflow is used to schedule the data distribution from GCP to local sever, manage the interface to database, and also trigger the front-end. The message will be sent out as email format when the sheduling fails.

Data Pipeline

Performance

The timing of reading the data from GCP is about ~20mins, and the timing of extracting, transforing and displaying on a dashboard is about ~17s.

Packages

Dash
Pymongo

mzcolor001 / biotech-consulting-project Goto Github PK

biotech-consulting-project's Introduction

Table of Contents

Introduction

Components

Data Production

Data Storage

Front-end

Data Monitoring/Scheduling

Data Pipeline

Performance

Packages

biotech-consulting-project's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs