GithubHelp home page GithubHelp logo

mfleming99 / dnaassayclassifier Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 0.0 46.01 MB

Classifier that can identify the assay of an unsequenced DNA read based of the result of Bowtie2 and other DNA analysis tools.

Python 100.00%

dnaassayclassifier's Introduction

DNA Assay Classifier

This is Max Fleming and Udit Sharma's independent study project for Fall 2019, where we will be building a classier that can identify the assay of an unsequenced Human DNA read based of the result of Bowtie2 and other DNA analysis tools. This project is split up in to two parts, data generation and model training.

Data Generation

All the functionality needed to generate data is under our data folder. In order to generate your own data using our functionality you will need the following dependencies.

  1. Bowtie2 We used this branch which enables random sampling of SRA files.
  2. NCLS This is the interval tree implementation we used to implement our gene annotation percentage.

After installing dependencies, you will be able to generate data using the following command.

python3 driver.py --index "location of your Bowtie index" --gtf "Location of your gene transfer format file" --mrna "List of your mRNA accessions" --srna "List of your sRNA accessions" --wgs "List of your WGS accessions"

You can get prebuilt Bowtie indexes from the Bowtie2 Manual . We used H. sapiens, NCBI GRCh38 (Download link below) as our reference genome. We used this (Download link below) gene transfer format file, however as long as your gtf matches your reference genome, and gtf should lead to a correct gene annotation percentage.

(ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz)

(ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz)

Getting SRA accessions is a little bit harder than the index and gtf. We used SRAdb to find accessions that we would want to use. We also included the accessions we used under data/sample_accessions folder in this repo.

The output will be three files: data_output_mrna.csv, data_output_srna.csv and data_output_wgs.csv. These files contain data about the SRA accessions that it was passed.

Model

There is an interactive version of our model available on the bt2-ui. Here you can input an accession and our model will predict the assay of the model. The model available on the UI is a random forest, and it achieved a test accuracy of 91%

Both a KNN and the random forest model available on the bt2-ui can be created by running the command:

python3 main.py --wgs classifier_data/wgs2.csv --mrna classifier_data/mrna2.csv --srna classifier_data/srna2.csv

The output will be the scores of the two models, and a pickle file named forest containing the random forest, as well as a pickle file named knn containing the knn model.

dnaassayclassifier's People

Contributors

mfleming99 avatar usharma6 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.