GithubHelp home page GithubHelp logo

biotech-consulting-project's Introduction

Table of Contents

  1. Introduction
  2. Components
  3. Data Pipeline
  4. Performance

Introduction


Biotrain is a querying and aggregating platform for showing the medication impacts on patients with cancers. It is a data pipeline that enables researchers and doctors to trace the medications being taken by the patients, and obtain the experiments being run on them and its corresponding sequence data. The informatoin gained from the sequence data are displayed in a dashboard so that it could help researchers and doctors to decide further treatment for the patients and what kind of organisms is crucial for curing certain cancers and also for preventing cancers.

Biotrain uses data from Persephone Biome to serve the application. The data is stored in GCP buckets, and Python script to clean and extract the patient information, and generate the JSON files for differnt experiments run on the patients. MongoDB is used for aggregating the patient info., experimental runs and analysis results with patients. Pymongo is used as a MongoDB driver to talk to the database. Dash is used to build the front-end so the user could query their desired information. Apache Airflow is used to automate the above pipeline and send the monitoring information to the user.

Components


  • Data Production

  • The data resides in the GCP is read into the local server through file and the data size is about 30Gb, the file hierachy is organized below:
├── Persephone
          ├── metadata_sanguine_Cancer.xlsx
          ├── metadata_sanguine_Controls.xlsx
          ├── metadata_sanguine_Cancer-human.xlsx
          ├── metadata_sanguine_Controls-human.xlsx
          ├── MiniSeq Submission Sheet.xlsx
          ├── wgs11N
                ├── analysis
                │      └── centrifuge
                │ 		      ├── 05_S1
                │           │     └── kreport.tsv.gz
                │           │     └── report.tsv.gz
                │           │     └── hits.tsv.gz
                │           │...
                │           │...
                │           │...
                │           ├── 64_S45
                │           │     └── kreport.tsv.gz
                │           │     └── report.tsv.gz
                │           │     └── hits.tsv.gz
                ├── trimmed
                        └── 25_S6_R2_001_val_2.fq.gz
                        └── ...
                        └── ...
                        └── ...
                        └── 17_S3_R1_001.fastq.gz_trimming_report.txt
  • Data Storage

  • Schema

Given the possibility of the frequent schema changing, MongoDB is chosen to resolve this problem. Also given the complexity of the data format, one-to-many relationships and the size of each record (some of the files > 2GB), the model tree is adopted insted of the embedded documents. Please refer the schema below.

  • Aggregation

There are five collections which contain the patient info., sample info., experimental run, experimental result and analysis result in the database. Connecting all five collections was tried, and it didn't work with this case. MongoDB doesn't provide multiple collections' join. This aggregation is done through this script.

  • Front-end

The analysis result is read through MongoDB and displayed with Dash. The user is able to search the desired patient from the dropdown and query and display the patients' information.

  • Data Monitoring/Scheduling

In order to automate the whole process mentioned above, Apache Airflow is used to schedule the data distribution from GCP to local sever, manage the interface to database, and also trigger the front-end. The message will be sent out as email format when the sheduling fails.

Data Pipeline


Performance


The timing of reading the data from GCP is about ~20mins, and the timing of extracting, transforing and displaying on a dashboard is about ~17s.


Packages

  1. Dash
  2. Pymongo

biotech-consulting-project's People

Contributors

mzcolor001 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.