GithubHelp home page GithubHelp logo

xtordoir / popstrat Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nfergu/popstrat

0.0 1.0 0.0 147 KB

Population Stratification Analysis on Genomics Data Using Deep Learning

License: Apache License 2.0

Scala 100.00%

popstrat's Introduction

Introduction

PopStrat is a simple example of population stratification analysis on genomics data using "deep learning" (neural networks). That is, it aims to predict which population group an individual belongs to based on their genome.

For a more detailed explanation see this blog post on bdgenomics.org.

The following technologies are used:

  • ADAM: a genomics analysis platform and associated file formats
  • Apache Spark: a fast engine for large-scale data processing
  • H2O: an open source predictive analytics platform
  • Sparking Water: integration of H2O with Apache Spark

The example consists of a single Scala class: PopStrat.

Prerequisites

Before building and running PopStrat ensure you have version 7 or later of the Java JDK installed.

Building

To build from source first download and install Maven. Then at the command line type:

mvn clean package

This will build a JAR (target/uber-popstrat-0.1-SNAPSHOT.jar) containing the PopStrat class, as well as all of its dependencies.

Running

First download Spark version 1.2.0 and unpack it on your machine.

Next you'll need to get some genomics data. Go to your nearest mirror of the 1000 genomes FTP site. From the release/20130502/ directory download the ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz file and the integrated_call_samples_v3.20130502.ALL.panel file. The first file file is the genotype data for chromosome 22, and the second file is the panel file, which describes the population group for each sample in the genotype data.

Unzip the genotype data before continuing. This will require around 10GB of disk space.

To speed up execution and save disk space you can convert the genotype VCF file to ADAM format (using the ADAM transform command) if you wish. However this will take some time up-front. Both ADAM and VCF formats are supported.

Next run the following command:

YOUR_SPARK_HOME/bin/spark-submit --class "com.neilferguson.PopStrat" --master local[6] --driver-memory 6G target/uber-popstrat-0.1-SNAPSHOT.jar <genotypesfile> <panelfile>

Replacing <genotypesfile> with the path to your genotype data file (ADAM or VCF), and <panelfile> with the panel file from 1000 genomes.

This runs PopStrat using a local (in-process) Spark master with 6 cores and 6GB of RAM. You can run against a different Spark cluster by modifying the options in the above command line. See the Spark documentation for further details.

Using the above data PopStrat may take up to 2-3 hours to run, depending on hardware. When it is finished you should see output that looks something like the following:

Confusion Matrix (vertical: actual; across: predicted):
       ASW CHB GBR  Error      Rate
   ASW  60   1   0 0.0164 =  1 / 61
   CHB   0 103   0 0.0000 = 0 / 103
   GBR   0   1  90 0.0110 =  1 / 91
Totals  60 105  90 0.0078 = 2 / 255

This is a confusion matrix which shows the predicted versus the actual populations. All being well, you should see an overall accuracy of more than 99% (only one or two predictions should be incorrect).

Code

A single Scala class at src/main/scala/com/neilferguson/PopStrat.scala contains all of the code for PopStrat.

See this blog post on bdgenomics.org for a deep dive into the code.

The code is fairly straightforward and follows the following high level flow:

  1. Load the genotype and panel data from the specified files
  2. Filter out those samples that aren't in the populations we are trying to predict
  3. Filter out variants that are missing from some samples
  4. Reduce the number of dimensions in the data by filtering to a (fairly arbitrary) subset of variants
  5. Create a Spark SchemaRDD with each column representing a variant and each row representing a sample
  6. Convert the SchemaRDD to an H2O data frame.
  7. Convert the data frame into 50% training data and 50% test data
  8. Set the parameters for the deep learning model (we use two hidden layers each with 100 neurons) and train the model
  9. Score the entire data set (training and test data) against the model

Credits

Thanks to the folks at Big Data Genomics for the original blog post that inspired this.

popstrat's People

Contributors

nfergu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.