GithubHelp home page GithubHelp logo

genomic_interval_pipeline's Introduction

A Pipeline for Building Genomic Annotation Datasets for Deep Learning

Crates.io Crates.io Docs

This is a pipeline for creating HDF5 input to Keras from genomic regions and annotations in Rust. It is a (somewhat) drop-in replacement for Basset's preprocessing pipeline, intended to transform a list of BED files into annotated one-hot encoded sequences for use in a deep learning model. The input and output of both pipelines should be similar, with this one being substantially faster for larger datasets.

To install, use cargo or alternatively build from source.

cargo install genomic_interval_pipeline

Building from source

Ensure that you have installed cargo and Rust on your system, then clone this repository.

git clone [email protected]:Chris1221/genomic_interval_pipeline.rs.git
cd genomic_interval_pipeline.rs

Use cargo to build the executable. It should figure out all the dependencies for you.

cargo build --release

The binary will be in target/release/genomic_interval_pipeline.

Usage

You must provide:

  1. A newline seperated list of gzipped BED files.
  2. Path to the reference genome. This must be compressed with bgzip and indexed by samtools faidx.

An example of the first can be found in data/metadata.txt:

data/file1.bed.gz
data/file2.bed.gz

To create the relevant reference genome is straightforward.

  1. Download your reference genome of choice, for example hg19. Here, I just used the UCSC genome browser.
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
  1. Compress your reference genome with bgzip (if this is already true, skip this step.)
gunzip hg19.fa.gz
bgzip hg19.fa
  1. Index with samtools.
samtools faidx hg19.fa.gz

Running the pipeline

Invoke the binary with the paths to the metadata of BED files and the reference genome (you don't have to specify where the index is).

genomic_interval_pipeline -i data/metadata.txt -f hg19.fa.gz -o small_dataset

This will create your dataset at small_dataset.h5.

Arguments

Short Long Value Description
-i --input String Path to a newline seperated list of bed files to process.
-f --fastq String Path to faidx indexed, bgzip compressed reference FASTQ file
-o --output String Path to the output .h5 file.
-m --min_overlap Number Minimum overlap required to merge segments (default: 200)
-e --exclusive Bool Perform multiclass learning rather than multilabel (i.e. exclude cases where multiple cell types are annotated, only writing unique values) (defaul: false)
--length Number Standardised length of regions (default: 600)
--test_chr String Comma seperated list of chromosomes to use in the test set (default: chr19,chr20)
--valid_chr String Comma seperated list of chromosomes to use in the validation set (default: chr21,chr22)
--loglevel String Level of logging (default: info)

Dataset Format

HDF5 files are essentially directories of data. There are six tables within the dataset corresponding to the training, test, and validation sequences and their labels.

Sequences are 3D arrays with dimenions (batch, length, 4) where length is optionally specified when building th dataset and refers to the standardized length of the segments.

Labels are 2D arrays with dimensions (batch, number_of_labels) where number of labels is the length of the metadata file. You can easily recode this dataset inside the HDF5 file for more bespoke training outputs.

Nonstandard Labels

Labels are given to bed file sequentially (as a SERIAL ID would be given in a SQL table), however this behaviour can be overriden. Instead of one column metadata file, you may have a two column metadata file seperated by spaces. The second column is the numeric label of this file.

path/to/file1.bed.gz numeric_label_1
path/to/file2/bed.gz numeric_label_2

See the example in data/metadata_custom.txt.

Using the dataset in Keras

You can use this data in your own neural network with the TensorFlow I/O API. Here is an example in Python.

import tensorflow     as tf
import tensorflow_io  as tfio

dataset = "small_dataset.h5"

train_x = tfio.Dataset.from_hdf5(dataset, "training_sequences")
train_y = tfio.Dataset.from_hdf5(dataset, "training_labels")
train_dataset = tf.data.Dataset.zip((train_x, train_y))

Pass this train_dataset (and similarly test and validation) to model.fit.

genomic_interval_pipeline's People

Contributors

chris1221 avatar

Watchers

 avatar  avatar  avatar

genomic_interval_pipeline's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.