dwinter / accumulate Goto Github PK

code for calling mutatons from MA lines

License: MIT License

C++ 92.88% Shell 1.64% CMake 5.42% C 0.05%

accumulate's Introduction

Calling mutations from MA lines

accuMUlate is a mutation caller designed with Mutation Accumulation (MA) experiments in mind. The probablistic approach to mutation calling implemented by accuMUlate allows us to identify putative mutations, accommodate noise produced by NGS sequencing, and accommodate diploid, haploid or diploid to haploid experimental designs.

The preprint is available on bioRxiv accuMUlate: A mutation caller designed for mutation accumulation experiments

Getting started

The wiki gives a detailed account of how to prepare your data, compile accuMUlate, run the program and understand the results. If you want to get started even more quickly here's what you need to know.

Prerequisites

In order to install accuMUlate you will need the following libraries

You will also need CMake to manage the build. Package managers for linux distributions and OS X should let you instal pre-compiled versions of Eigen, Boost and CMake. The wiki describes how to compile BamTools. If you want to run the program's unit tests you also need GTest, but this is not a requirement to get the software running.

Compilation

With prerequisites installed, building the software is easy:

cd build
cmake ..
make

Test run

The above commands should make two programs in the build directory: accuMUlate (the mutation caller) and denominate (a tool for calculating the number of callable sites in a BAM file). To test that everything has gone well you can run these on some test data. First accuMUlate, running from the working directory this should produce a warning message and detailed information about one possible mutation:

build/accuMUlate -c test/data/example_params.ini \
             -b test/data/test.bam \
             -r test/data/test.fasta \
             -i test/data/test.bed

Warning: excluding data from 'D6' which is included in the BAM file but not the list of included samples
good_mutation	600	601	C	D1	CC->G	0.999999	0.999999	1	1.86137e-10	137	13	11	0	011.8531	-0.756908	1	1

And then denominate, the mysterious string of integers are the number of ancestrally "A", "C", "G" and "T" bases for each sample that could be called for a mutation if one was present.

build/denominate -c test/data/example_params.ini \
            -b test/data/test.bam \
            -r test/data/test.fasta \
            -i test/data/test.bed \
            --max-depth 150

Warning: excluding data from 'D6' which is included in the BAM file but not the list of included samples
2	1	0	0	2	1	0	0	2	1	0	0	2	1	0	0	2	1	0   0

How it works

The accuMUlate model is described in. Long et al. Low base-substitution mutation rate in the germline genome of the ciliate Tetrahymena thermophila. Genome Biology and Evolution 8: 3629-39 (2016). doi: https://doi.org/10.1093/gbe/evw223

Help, bugs, and suggestions

Your first stop should be the wiki, which contains information about the input and output files and more detail about using accuMUlate. If that doesn't help please file issues at this repository or email David Winter.

accumulate's People

Contributors

Stargazers

Watchers

Forkers

stevenhwu reedacartwright aahowel3 douglasgscofield ultimatesource

accumulate's Issues

Try to remove dependence on third-party/bamtools

Clean up

There are now a lot of files in the root directory. For the sake of maintainability, we should set up a better structure, with most of these files in /src or include directories

New post processor

The current post-processor was hacked together to get some results quickly. As of the latest commits it does not compile, and rather than patch it up should probably be rebuild from the ground up. Here's how I see it working ultimately.

A C++ executable that

Calls the mutant line from site data using our model (at present accMUlate calculates P(one_mutant|data), but doesn't collapse the likelihoods down to identify the mutant line or the direction of mutation)
Gathers summary info similar to GATK genotyping applications, but with focus on mutation detection (coverage, mutant allele freq in "non mutant" samples etc etc)
Outputs all of that to a big table

An R script (possibly called form the executable) that summarizes that table.

In particular converting per-sample coverage from absolute numbers to quantiles
performing rank-sum tests for over-representation of possible artifiacts in mutant cal etc

The final output returned either as table summarising mutant calls or a vcf

Segmentation fault (core dumped)

I ran accuMUlate and got an error:

        Segmentation fault (core dumped)

The codes went well first, but after several lines output, it turn out this wrong. Have anyone had the same problem and the solution?

Thanks a lot

-i option for accuMUlate broken

When running accuMUlate in parallel using bedfile intervals (made using bedtools window) for the option -i accuMUlate only processes the last interval

sorting of BAM files

Hi!
I'm currently using the accuMUlate tool on my samples, but I have faced one weird error when the job finishes.

I receive several times the message "Pileup::Run() : Data not sorted correctly!".
Nonetheless I have the accuMUlate output as .txt and it is reasonable.

According to samtools, it was sorted and everything appears to be right with the merged bam file (20 bams in total).

Was this problem already reported and can it interfere on the output? Should the merged bam file be sorted? (since it was reported that for large datasets it sometimes doesn't work)

My pipeline was:
picard MergeSamFiles to merge; followed by samtools sort and samtools index.

I would appreciate your input on this unusual error.
Thank you

Clean up argument parsing/ setting up analysis in "main" loop

As we add more user-specified values, the arugument parsing and "setting up" of values required for to run the analaysis get's more and more cumbersome. It would be a good idea to break some of these down into desperate (testable) functions so the logic of the whole thing can be made clearer.

Mutational context in results table

Hello! I find this tool incredibly useful. Just one possible enhancement could be to report the upstream and downstream bases (context) of the mutation to make it easier to make mutational signatures ala Alexandrov et al. 2013 and to make figures suchs as figure 1 in Harris & Pritchard, 2017.

Abort

Hi,

I ran accuMUlate and got an error:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551615) > this->size() (which is 60)
Aborted

It seems somewhere wrong in Cpp codes. Any ideas how to fix it?

Thanks very much.

Deletions being reported as SNPs

In the following situation accuMUlate reports deletions as SNPs (tview-like output):

TAC
.*.
.*.
,*,
,*,

This is reported as an A>C SNP instead of a deletion. I suspect that CIGAR parsing is failing (these lines have hard clipping as well).

Call the most likely mutation

At the moment, we write out the position of each "mutation ish" site from the main accumulate binary. The post-processor uses a hacky way of getting the most likely mutation but we should do better.

We shuold go back and work out which "mutation path(s)" are most likely. As part of #3 we should make this call and provide it for the user in the 'main' output file with columns for

chrom start end p-any p-one mutant-line from-base to-base ref-base

Reuse genotyp probabilities using cache

Replace these two functions in main.cc with cached version

TetMAProbability(m_params, site_data);
TetMAProbOneMutation

Using accuMUlate with n=3

Hi! I wonder if the tool can work with polyploid data. When I use n=3 on my params.ini I get the error terminate called after throwing an instance of 'boost::program_options::invalid_option_value' what(): the argument ('accuMUlate can't only deal with haploid or diploid ancestral samples') for option is invalid but I do not get the same issue when I try running it as n=2. Can this be solved somehow?

Exit when specified reference genome is not readable

At present accuMUlate warns when the a reference genome is specified by not readable, should exit with a meaningful error message.

Documentation

There should be much more extensive documentation including

a "quick start" example usage
detailed explanation of input formats (i.e. single bam with sample headers)
documentation of parameters and their meaning
detailed documentation of usage, including parralelization
documentation of output formats
example of processing ouputs

More stuff that working with @aahowel3 has thrown up:

Carefully document meaning of sample-name (i.e. SM tag in read group) and demonstrate a way to extract all samples from a BAM header

We are nearly there on this one.

Document statistical tests in output
Include link to accuMUlate-demonstration repo as an example pipeline

Speed up pileup

At the moment we use bamtools for pileup which is fairly slow. We could probably get some seed-up either by using @stevenhwu's 'bamtools utils' code to speed up parsing of reads or DNG's approach which uses htslib

When using intervals, only call sites within interval

At the moment, when the user specifies intervals to call mutations on we end up with some calls from either side of the intervals. That's because the bamtools parser returns all reads that overlap with a region:

...[GAATGCTAGC]...
 CACGAATGCTA
 CACGAATGCTA
  ACGAATGCTA

It's only wasteful to do this , but the calls from outside of the interval will be misleading, because they don't include all the reads that overlap the called-site.

When using the interval mode, the GatherReadData() function should immediately return false when the pileupPosition is outside of the interval.

Specify nucelotide frequencies via one line in params.ini

SO answer seems to cover it:

http://stackoverflow.com/questions/8175723/vector-arguments-in-boost-program-options

Should be done as part of #9

useful error when specified config file unavailable

Take genotype likelihoods or raw counts as input

This might be a "next release" sort of goal, as it's beyond what we plan for the paper but would nevetheless be helpful.

At the moment we are only calling SNV mutations because that's what our DM models is good goor.
However, if would be useful to call putative indels, CNVs and other mutations in MA framework. The easiest way to do that would to either

take counts (i.e. number of reads) for each allele as use the DM model for likelihoods
take genotype likelihoods calculated from Some Other Software (vcf?) as input and

The latter would require a vcf parser (probably htslib?)

Use all sequencing data to infer ancestral genotype when summarising mutation

At present we use only the ancestor's sequencing data when inferring the ancestral genotype in DescribeMutant. However, due to the nature of MA experiments, all the sequencing data tells us something about the probability of ancestral genotype (since most descendants will have the ancestral allele at most sites). Thus we shuold use the over-all genotype probabilities (i.e. anc_genotypes in TetMAProbability ) when determining the ancestral genotype.

NOTE: This only effects the from->to column in our output, not the mutation probabilities which already incorporate this information.

Document estimation of Phi

As discussed #37 , add a wiki example of how to get an estimate of phi

Allow users to specify samples to include (and ancestral sample)

At present we use all samples in a BAM file, but there may well be cases in which it make sense to exclude some samples (actually the Tt data is a case in point!)

We also assume that the ancestor is the first sample off the ReadDataVector, which is not always going to be the case

Allow user to specify name of samples to include via config file(and possible warn if some samples in the read-group dictionary are being skipped)
Make the ancestral sample a special case
Possibly modify the site data struct to include ancestral reads as a separate element

(this is related to #6 since the data-collecting function would have to accomidate this change)

Initialzation of m_samples() and ReadDataVector

At present, the ReadDataVector is initalised with the same length as m_samples which is the number of readgroups encountered in parsing the BAM header -- meaning samples with more than one readgroup are counted twice

The length should be equal to the number of samples.

Only use primary lines by default

From: https://samtools.github.io/hts-specs/SAMv1.pdf

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies FLAG & 0x900 == 0 This line is called the primary line of the read.

In the latest Tt run, accuMUlate call SNPs if one of the strains has a chimeric alignment. This SAM lines that are part of a chimeric alignment have their 'supplementary alignment' flag set. It is more natural to restrict calls by default to variants supported by primary alignments.

Check ploidy of anc/descentant matches available experimental design

At the moment we rely on only the folowing designs being specified:

Diploid -> Diploid
Diploid -> Haploid
Haploid -> Haploid

We should check that these designs are specified when we parse the command line args and provide a useful message if not. Relates to #9 and #5

Optionally write to std out

At the moment we write to an ostream which has to be a file specified via the command line / config arguments. We should allow this to be set to to stdout (or have this be the default).

(all other messages are printed to std::cerr so this shouldn't change the over-all behavior of what is returned)

function for ReadDataVector collection

At the moment, both the "main" function and each of the utilility functions have their own ReadDataVector collecting visitiors. Which leads to a whole heap of replicated code.

It would make more sense to have a single 'ReadData collector' function, with different visitors to deal with the resulting vectors. This design would also make it easier to swap out Bamtools for and htslib pileup in the future.

Implement new ReadData constructor for different cases

ReadData should have at least two types of constructor.

single uint64_t as key
4 uint14_t as read deapth

Allow variable ploidy for ancestor/descendant lines

Will require

User specified ploidy variables for anc/desendant lines
Haploid prior on genotypes
`MutationAccumulation Matrix and MAprobability functions need to take variable sized matrices

Speed of parsing

Speed is going to be a real issue with the in-memory site calling. At present it is going to take ~6 days to process all 100 million sites for Tetrahymena, which isn't really feasible.

I can try and speed up some of what i'm going, but think we will need to replace Bamtools with another BAM processing library to get a real speed up.

Cache re-used objects

At the moment there are a few objects that get created every time we process a site's data. In particular, three MutationMatrix objects get built up with every mutation call, but the exact same one is used every time.

It would make more sense to have these matrices as global variable or (better) members of the VariantVisitor classes that be set once at the start of the run.

Core dumped

Hi! I am trying to run accuMUlate in my 64 samples, but it is core-sumping with the following error:

$ accuMUlate -c params.ini -b CAN.bam -r ../reference_clean.fasta > putative_mutations.tsv
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::program_options::required_option> >'
  what():  the option '--mu' is required but missing
Aborted (core dumped)

The CAN.bam was prepared following your instructions, as well as the atached params.init.
Am I doing something wrong?
params.txt

PS: The ancestor has not been sequenced, so is only in label

Output format

We should decide on one...

At the moment we print a bed-like file that has onlt the probability of mutation not the mutant sample or its genotype. It might make sense to invoke something like what we have in the post-processor (#3) when a site meets the probability cut so we can write a lot more information about each site.

Core dump

Hi!

I am trying to use the tool on a set of 7 samples (including ancestor). When running it on parallel, I get an error saying : terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr. This is how my parameters file looks like
parameters.txt

I am not sure of what can be going wrong.