GithubHelp home page GithubHelp logo

koadman / graphdeconvolution Goto Github PK

View Code? Open in Web Editor NEW
5.0 10.0 0.0 23.65 MB

deconvolution of species & strain genome mixtures in assembly graphs

Makefile 0.54% C++ 32.10% C 1.90% Cuda 62.25% Python 2.52% Stan 0.22% R 0.03% Nextflow 0.43% Shell 0.03%

graphdeconvolution's Introduction

assembly graph deconvolution

A repository for work on deconvoluting assembly graphs containing strain mixtures using time-series abundance information.

Quick start

Assuming you are starting in a directory containing a collection of paired-end Illumina read files, with names ending in the usual R?.fastq.gz, the software can be run as follows:

export BBMAP=/path/to/bbmap
export GDECONHOME=/path/to/this/repo
find `pwd` -maxdepth 1 -name "*R1.fastq.gz" | sort > read1_files.txt
find `pwd` -maxdepth 1 -name "*R2.fastq.gz" | sort > read2_files.txt
$GDECONHOME/btools/gdecon.nf --readlist1=read1_files.txt  --readlist2=read2_files.txt --r1suffix=R1.fastq.gz

If the workflow has completed successfully then the assembled strain genomes will appear in the directory out/.

Dependencies and prerequisites

  • Linux, kernel 2.6.32 or later
  • nextflow, version 0.24 or later
  • python 3
  • gfapy
  • several others

Major components of the workflow

The workflow first constructs a compacted de Bruijn graph using bcalm, then trims the dBg using btrim. The abundance of each path (unitig) in each sample is then estimated via tigops. Next, the abundance data and graph structure are given to a Bayesian graph deconvolution model. The posterior output from the model is summarised and the number of strain genomes as well as their sequences is then recorded.

Bayesian assembly graph deconvolution implemented in Stan

A model has been implemented in Stan code to carry out assembly graph deconvolution. The input file for the model must contains the following information about the unitig graph:

  • V: the number of unitigs
  • S: the number of samples
  • depths: a list of length VxS containing the depth of k-mer coverage for each unitig in each sample. Sample depths are given in succession for each unitig, e.g. if we denote the depth for unitig i in sample j as i.j, the depths are given as 1.1, 1.2, 1.3, ... , 2.1, 2.2, 2.3 and so on.
  • adj1count: the number of unitigs with a single outgoing edge on an end
  • adj1source: the source unitig ID for each unitig with a single outgoing edge on an end. Negative values indicate an outgoing edge coming from the end of the reverse-complement unitig sequence.
  • adj1dest: the destination unitig ID corresponding to each of the source nodes listed in adj1source
  • adj2count: the number of unitigs with two outgoing edges on an end
  • adj2source: unitig IDs with two outgoing edges on an end
  • adj2dest1: the first destination edge for the corresponding unitig given in adj2source
  • adj2dest2: the second destination edge for the corresponding unitig given in adj2source
  • lengths: the lengths of the V unitigs

The MEGAHIT hack

We also created a hacked-up version of megahit, designed to take multiple samples. It uses the same command-line interface as standard megahit. Two additional output files are generated in addition to the usual megahit outputs. The first is intermediate_contigs/k*.unitigs.fa, which contains the unitig sequences. This can optionally be processed with the megahit toolkit contigs2fastg program to generate a graph viewable in BANDAGE. The second output is intermediate_contigs/k*.unitig_depths.Rdata. This file is as described above.

graphdeconvolution's People

Contributors

koadman avatar snurk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphdeconvolution's Issues

reads for a tiny test dataset

Now that something resembling an end-to-end pipeline is implemented it would be helpful to have a small test set for a usage example. @chrisquince would you be willing to push the read set behind the 3 gene x 5 strain data we were using during the hackathon?

Invalid tag name for GFAv1 specification

g.write('H\tVN:Z:1.0\t k:i:%d\n' %k) # includes the k-mer size

This line introduces an optional field, using a tag which is non-compliant with respect to the GFAv1 specification. Tags are required to be a two characters long, with the pattern: [a-zA-Z][a-zA-Z0-9]

https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md#optional-fields

I would suggest km but it is already being used for Segment. However, there is no reason you cannot use the same tag on two different records, with different meanings. Perhaps this is confusing. Tags are also case-sensitive, so you could disambiguate it slightly with KM.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.