GithubHelp home page GithubHelp logo

accumulate's Introduction

Build Status DOI

Calling mutations from MA lines

accuMUlate is a mutation caller designed with Mutation Accumulation (MA) experiments in mind. The probablistic approach to mutation calling implemented by accuMUlate allows us to identify putative mutations, accommodate noise produced by NGS sequencing, and accommodate diploid, haploid or diploid to haploid experimental designs.

The preprint is available on bioRxiv accuMUlate: A mutation caller designed for mutation accumulation experiments

Getting started

The wiki gives a detailed account of how to prepare your data, compile accuMUlate, run the program and understand the results. If you want to get started even more quickly here's what you need to know.

Prerequisites

In order to install accuMUlate you will need the following libraries

You will also need CMake to manage the build. Package managers for linux distributions and OS X should let you instal pre-compiled versions of Eigen, Boost and CMake. The wiki describes how to compile BamTools. If you want to run the program's unit tests you also need GTest, but this is not a requirement to get the software running.

Compilation

With prerequisites installed, building the software is easy:

cd build
cmake ..
make

Test run

The above commands should make two programs in the build directory: accuMUlate (the mutation caller) and denominate (a tool for calculating the number of callable sites in a BAM file). To test that everything has gone well you can run these on some test data. First accuMUlate, running from the working directory this should produce a warning message and detailed information about one possible mutation:

build/accuMUlate -c test/data/example_params.ini \
             -b test/data/test.bam \
             -r test/data/test.fasta \
             -i test/data/test.bed 
Warning: excluding data from 'D6' which is included in the BAM file but not the list of included samples
good_mutation	600	601	C	D1	CC->G	0.999999	0.999999	1	1.86137e-10	137	13	11	0	011.8531	-0.756908	1	1	

And then denominate, the mysterious string of integers are the number of ancestrally "A", "C", "G" and "T" bases for each sample that could be called for a mutation if one was present.

build/denominate -c test/data/example_params.ini \
            -b test/data/test.bam \
            -r test/data/test.fasta \
            -i test/data/test.bed \
            --max-depth 150
Warning: excluding data from 'D6' which is included in the BAM file but not the list of included samples
2	1	0	0	2	1	0	0	2	1	0	0	2	1	0	0	2	1	0   0	

How it works

The accuMUlate model is described in. Long et al. Low base-substitution mutation rate in the germline genome of the ciliate Tetrahymena thermophila. Genome Biology and Evolution 8: 3629-39 (2016). doi: https://doi.org/10.1093/gbe/evw223

Help, bugs, and suggestions

Your first stop should be the wiki, which contains information about the input and output files and more detail about using accuMUlate. If that doesn't help please file issues at this repository or email David Winter.

accumulate's People

Contributors

aahowel3 avatar dwinter avatar stevenhwu avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

accumulate's Issues

Clean up

There are now a lot of files in the root directory. For the sake of maintainability, we should set up a better structure, with most of these files in /src or include directories

New post processor

The current post-processor was hacked together to get some results quickly. As of the latest commits it does not compile, and rather than patch it up should probably be rebuild from the ground up. Here's how I see it working ultimately.

A C++ executable that

  • Calls the mutant line from site data using our model (at present accMUlate calculates P(one_mutant|data), but doesn't collapse the likelihoods down to identify the mutant line or the direction of mutation)
  • Gathers summary info similar to GATK genotyping applications, but with focus on mutation detection (coverage, mutant allele freq in "non mutant" samples etc etc)
  • Outputs all of that to a big table

An R script (possibly called form the executable) that summarizes that table.

  • In particular converting per-sample coverage from absolute numbers to quantiles
  • performing rank-sum tests for over-representation of possible artifiacts in mutant cal etc

The final output returned either as table summarising mutant calls or a vcf

Segmentation fault (core dumped)

Hi

I ran accuMUlate and got an error:

        Segmentation fault (core dumped)

The codes went well first, but after several lines output, it turn out this wrong. Have anyone had the same problem and the solution?

Thanks a lot

-i option for accuMUlate broken

When running accuMUlate in parallel using bedfile intervals (made using bedtools window) for the option -i accuMUlate only processes the last interval

sorting of BAM files

Hi!
I'm currently using the accuMUlate tool on my samples, but I have faced one weird error when the job finishes.

I receive several times the message "Pileup::Run() : Data not sorted correctly!".
Nonetheless I have the accuMUlate output as .txt and it is reasonable.

According to samtools, it was sorted and everything appears to be right with the merged bam file (20 bams in total).

Was this problem already reported and can it interfere on the output? Should the merged bam file be sorted? (since it was reported that for large datasets it sometimes doesn't work)

My pipeline was:
picard MergeSamFiles to merge; followed by samtools sort and samtools index.

I would appreciate your input on this unusual error.
Thank you

Clean up argument parsing/ setting up analysis in "main" loop

As we add more user-specified values, the arugument parsing and "setting up" of values required for to run the analaysis get's more and more cumbersome. It would be a good idea to break some of these down into desperate (testable) functions so the logic of the whole thing can be made clearer.

Abort

Hi,

I ran accuMUlate and got an error:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 60)
Aborted

It seems somewhere wrong in Cpp codes. Any ideas how to fix it?

Thanks very much.

Deletions being reported as SNPs

In the following situation accuMUlate reports deletions as SNPs (tview-like output):

TAC
.*.
.*.
,*,
,*,

This is reported as an A>C SNP instead of a deletion. I suspect that CIGAR parsing is failing (these lines have hard clipping as well).

Call the most likely mutation

At the moment, we write out the position of each "mutation ish" site from the main accumulate binary. The post-processor uses a hacky way of getting the most likely mutation but we should do better.

We shuold go back and work out which "mutation path(s)" are most likely. As part of #3 we should make this call and provide it for the user in the 'main' output file with columns for

chrom start end p-any p-one mutant-line from-base to-base ref-base

Using accuMUlate with n=3

Hi! I wonder if the tool can work with polyploid data. When I use n=3 on my params.ini I get the error terminate called after throwing an instance of 'boost::program_options::invalid_option_value' what(): the argument ('accuMUlate can't only deal with haploid or diploid ancestral samples') for option is invalid but I do not get the same issue when I try running it as n=2. Can this be solved somehow?

Documentation

There should be much more extensive documentation including

  • a "quick start" example usage
  • detailed explanation of input formats (i.e. single bam with sample headers)
  • documentation of parameters and their meaning
  • detailed documentation of usage, including parralelization
  • documentation of output formats
  • example of processing ouputs

More stuff that working with @aahowel3 has thrown up:

  • Carefully document meaning of sample-name (i.e. SM tag in read group) and demonstrate a way to extract all samples from a BAM header

We are nearly there on this one.

  • Document statistical tests in output
  • Include link to accuMUlate-demonstration repo as an example pipeline

Speed up pileup

At the moment we use bamtools for pileup which is fairly slow. We could probably get some seed-up either by using @stevenhwu's 'bamtools utils' code to speed up parsing of reads or DNG's approach which uses htslib

When using intervals, only call sites within interval

At the moment, when the user specifies intervals to call mutations on we end up with some calls from either side of the intervals. That's because the bamtools parser returns all reads that overlap with a region:

...[GAATGCTAGC]...
 CACGAATGCTA
 CACGAATGCTA
  ACGAATGCTA

It's only wasteful to do this , but the calls from outside of the interval will be misleading, because they don't include all the reads that overlap the called-site.

When using the interval mode, the GatherReadData() function should immediately return false when the pileupPosition is outside of the interval.

Take genotype likelihoods or raw counts as input

This might be a "next release" sort of goal, as it's beyond what we plan for the paper but would nevetheless be helpful.

At the moment we are only calling SNV mutations because that's what our DM models is good goor.
However, if would be useful to call putative indels, CNVs and other mutations in MA framework. The easiest way to do that would to either

  • take counts (i.e. number of reads) for each allele as use the DM model for likelihoods
  • take genotype likelihoods calculated from Some Other Software (vcf?) as input and

The latter would require a vcf parser (probably htslib?)

Use all sequencing data to infer ancestral genotype when summarising mutation

At present we use only the ancestor's sequencing data when inferring the ancestral genotype in DescribeMutant. However, due to the nature of MA experiments, all the sequencing data tells us something about the probability of ancestral genotype (since most descendants will have the ancestral allele at most sites). Thus we shuold use the over-all genotype probabilities (i.e. anc_genotypes in TetMAProbability ) when determining the ancestral genotype.

NOTE: This only effects the from->to column in our output, not the mutation probabilities which already incorporate this information.

Allow users to specify samples to include (and ancestral sample)

At present we use all samples in a BAM file, but there may well be cases in which it make sense to exclude some samples (actually the Tt data is a case in point!)

We also assume that the ancestor is the first sample off the ReadDataVector, which is not always going to be the case

  • Allow user to specify name of samples to include via config file(and possible warn if some samples in the read-group dictionary are being skipped)
  • Make the ancestral sample a special case
  • Possibly modify the site data struct to include ancestral reads as a separate element

(this is related to #6 since the data-collecting function would have to accomidate this change)

Initialzation of m_samples() and ReadDataVector

At present, the ReadDataVector is initalised with the same length as m_samples which is the number of readgroups encountered in parsing the BAM header -- meaning samples with more than one readgroup are counted twice

The length should be equal to the number of samples.

Only use primary lines by default

From: https://samtools.github.io/hts-specs/SAMv1.pdf

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies FLAG & 0x900 == 0 This line is called the primary line of the read.

In the latest Tt run, accuMUlate call SNPs if one of the strains has a chimeric alignment. This SAM lines that are part of a chimeric alignment have their 'supplementary alignment' flag set. It is more natural to restrict calls by default to variants supported by primary alignments.

Optionally write to std out

At the moment we write to an ostream which has to be a file specified via the command line / config arguments. We should allow this to be set to to stdout (or have this be the default).

(all other messages are printed to std::cerr so this shouldn't change the over-all behavior of what is returned)

function for ReadDataVector collection

At the moment, both the "main" function and each of the utilility functions have their own ReadDataVector collecting visitiors. Which leads to a whole heap of replicated code.

It would make more sense to have a single 'ReadData collector' function, with different visitors to deal with the resulting vectors. This design would also make it easier to swap out Bamtools for and htslib pileup in the future.

Speed of parsing

Speed is going to be a real issue with the in-memory site calling. At present it is going to take ~6 days to process all 100 million sites for Tetrahymena, which isn't really feasible.

I can try and speed up some of what i'm going, but think we will need to replace Bamtools with another BAM processing library to get a real speed up.

Cache re-used objects

At the moment there are a few objects that get created every time we process a site's data. In particular, three MutationMatrix objects get built up with every mutation call, but the exact same one is used every time.

It would make more sense to have these matrices as global variable or (better) members of the VariantVisitor classes that be set once at the start of the run.

Core dumped

Hi! I am trying to run accuMUlate in my 64 samples, but it is core-sumping with the following error:

$ accuMUlate -c params.ini -b CAN.bam -r ../reference_clean.fasta > putative_mutations.tsv
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::required_option> >'
  what():  the option '--mu' is required but missing
Aborted (core dumped)

The CAN.bam was prepared following your instructions, as well as the atached params.init.
Am I doing something wrong?
params.txt

PS: The ancestor has not been sequenced, so is only in label

Output format

We should decide on one...

At the moment we print a bed-like file that has onlt the probability of mutation not the mutant sample or its genotype. It might make sense to invoke something like what we have in the post-processor (#3) when a site meets the probability cut so we can write a lot more information about each site.

Core dump

Hi!

I am trying to use the tool on a set of 7 samples (including ancestor). When running it on parallel, I get an error saying : terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr. This is how my parameters file looks like
parameters.txt

I am not sure of what can be going wrong.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.