plantandfoodresearch / mchap Goto Github PK

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.

License: MIT License

Python 99.93% Dockerfile 0.07%

dna-sequences haplotypes haplotype-assembly haplotype-blocks array numpy bayesian-inference polyploidy polyploid-genotyping genotyping

mchap's Introduction

MCHap

MCHap is suite of command line tools for micro-haplotype assembly and genotype calling in autopolyploids. The primary components of MCHap are mchap assemble and mchap call.

Installation

MCHap and it's dependencies can be installed from source using pip. From the root directory of this repository run:

pip install -r requirements.txt
python setup.py sdist
pip install dist/mchap-*.tar.gz

You should then be able to use the command line tool mchap which is a wrapper around mchap assemble and mchap call.

MCHap includes a suite of unit tests which can be run from the root directory of this repository with:

pytest -v ./

MCHap assemble

mchap assemble is used for de novo assembly of micro-haplotypes in one or more individuals. Haplotypes are assembled from aligned reads in BAM files using known SNVs (single nucleotide variants) from a VCF file. A BED file is also required to specify the assembled loci. The output of mchap assemble is a VCF file with assembled micro-haplotype variants and genotype calls (some genotype calls may be incomplete). See the MCHap assemble documentation for further information..

MCHap call

mchap call is used for (re-) calling genotypes using a set of known micro-haplotypes. Genotypes are called using aligned reads in BAM files and known micro-haplotype alleles from a VCF file. The output of mchap call is a VCF file with micro-haplotype variants and genotype calls (all genotype calls will be complete). It is often beneficial to re-call genotypes with mchap call using the micro-haplotypes reported by mchap assemble, particularly in populations of related samples. See the MCHap call documentation for further information.

MCHap find-snvs

mchap find-snvs is a simple tool for identifying putative SNVs to use as the basis for haplotype assembly. Putative SNVs are identified based on minimum thresholds for allele depths and/or frequencies estimated from depths. The output is reported as a simple VCF file which includes allele depths and population allele frequencies (estimated from the mean of individual frequencies), but no genotype calls.

Example notebook

An example notebook demonstrating genotype calling with MCHap in a bi-parental population.

Funding

The development of MCHap was partially funded by the "Tools for Polyploids" Specialty Crop Research Initiative (NIFA USDA SCRI Award # 2020-51181-32156).

mchap's People

Contributors

Stargazers

Watchers

Forkers

timothymillar mghanbari43 rfinkers

mchap's Issues

Use freebayes-like files for IO

This would simplify a few things even if it results in requiring more files, the files are easy to create.

bam-file list
- Shouldn't need to create a sample-bam map anymore?
sample-ploidy map (i.e. the new simpler cnv-map format in freebayes)
region-sample-ploidy map bedfile (not urgent but good to have in the long term)
sample-pedigree item map for adding replicates to pedigree graph
- Related #44

VCF output

Need VCF output of assembled haplotypes

In the long term this should include output of full posterior distribution in VCF (if possible)

In the short term this is just the "best" genotype (i.e. a call) that meets the followinf filters:

Posterior probability threshold e.g. p >=0.95
Read depth threshold e.g. depth >= 10
Read-representation via Kmers e.g. unmatched kmers <= 0.5% at any position (#17)
~~- [ ] MCMC convergence metric (#16 )~~

Called genotypes may bee output in two VCF types:

Wide format with full haplotypes as alleles
Long format with base alleles and phase indicated by PS tag
- Long format cold include partial haplotypes?

Check probability encoding of SNPs

Currently the error prob is being treated as "probability that the true allele it is not the called allele".
The alternate interpretation is "Probability that call is meaningless" in which case some portion of that probability
should be assigned to the called allele i.e. the calling went wrong but happened to result in a correct call.

This difference has a negligible affect on haplotype calls but has technical implementations when incorporating MNPs #51.

Correct NS Info Field

Meant to be the number of samples with data i.e. a GT field

Batch processing for large datasets

Currently the CLI builds a template (and corresponding dask graph) for the full VCF file.
This maximizes the potential for dask to distribute the computation but is still requires holding the entire VCF in memory in the primary python process.

To limit memory use we need to divide the VCF into chunks (based on loci) and stream the results to a file.
The simplest way to achieve this is to modify the .template function to take a subset of loci as an argument resulting in a smaller VCF chunk.
A separate dask graph can be built for each each chunk which is then computed and written to output befor the next chunk is processed.
This still enables distributed computation in the order of (loci * samples) within each chunk.
The user can set the chunk size in terms of number of loci per chunk to tune the compute/memory trade off.

If this doesn't provide enough flexibility the user still has the option of running completely in-depended jobs producing seperate VCFs (e.g. per chromosome) and then merging the VCF files.

pedigree plotting interpreting region as contig name

'CM009654.1:1-5000' is being treated as a contig name not a slice of a contig

Kmer checking of read representation

Add kmer checking of read representation in assembled genotypes

Break reads into aligned kmers of variable alleles (3mers)
Check if each kmer matches any of the assembled haplotypes
If proportion of non-matched kmers at a location is greater than a threshold (e.g. 5%) then test failed
Use as a Filter for VCF output

Metropolized-Gibbs sampler

Currently we are "brute forcing" Gibbs sampler transition probabilities by calculating likelihoods each possible transition and dividing by he sum.
This results in a transition matrix (for a given parameter) in which all rows are equivalent.
For example the likelihoods [0.3, 0.2, 0.1] result in the transition matrix:

[[0.5       , 0.33333333, 0.16666667],
 [0.5       , 0.33333333, 0.16666667],
 [0.5       , 0.33333333, 0.16666667]]

If the transition matrix is instead calculated using the Metropolis-Hastings ratio we get the matrix:

[[0.5       , 0.33333333, 0.16666667],
 [0.5       , 0.25      , 0.25      ],
 [0.5       , 0.5       , 0.        ]]

This has equivalent long run behavior to the Gibbs matrix.
However, the second matrix is more "statistically efficient" because the chain is less likely to stay in the same state at each step (lower diagonal values).
This is even more pronounced as chain temperature is increased (see #65 ).
With a temperature of 0 the Gibbs transition matrix becomes

[[0.33333333, 0.33333333, 0.33333333],
 [0.33333333, 0.33333333, 0.33333333],
 [0.33333333, 0.33333333, 0.33333333]]

where the MH matrix becomes

[[0. , 0.5, 0.5],
 [0.5, 0. , 0.5],
 [0.5, 0.5, 0. ]]

In the literature a Metropolized-Gibbs sampler uses the transition values from a Gibbs sample to develop a more efficient sampler.
This is not the same as calculating the MH transition probabilities directly (as above) but is similar.
Either method would improve sampling efficiency especially in combination with #65

Large scale denovo assembly

Need to support assembly of thousands of loci across hundreds of samples.

basic workflow overview:

extract read-variants at specified loci and store them in a dense array structure for fast access
assemble read-variants and store trace in dense array structure for fast access
call haplotypes from trace (filter based on confidence) and output VCF file

Ideally the dense array filetype would support concurrent read/write by multiple processes

Investigate suitable tools for this:

Dask or regular standard lib for multi-processing?
HDF5 or Zarr for block storage of read-variants and traces?

Filter variants

It would be good to filter variant records in the VCF produced by de novo assembly.

This is important for downstream analysis such as re-calling using the VCF as a reference set of haplotypes.
If haplotypes are re-called for a locus with few or no successful assemblies then the resulting re-called haplotypes will likely be incorrect

This could be based on the proportion of samples that are individually filtered.

Calculate MEC

Useful to have Minimum Error Correction score in VCF output for each sample at each locus

Atomize VCF script

See #18 for description of wide vs long format.
Currently the assemble program outputs wide format VCF files i.e. each line contains a full haplotype block.
This is the most suitable output for the tool giving posterior probabilities etc for full haplotypes.

Long format VCF files (phased SNPs) would be useful and these can be generated by "atomizing" the haplotypes in the wide format VCF. This process will likely result in removal of some information relating to the full haplotype.

Documentation

Need to document at least denovo program for publication.
Initially this can simply be done in an Rst file.

Add chain incongruence filter

This would add a per-sample filter code in the event that any two replicate MCMC traces have high support for differing phenotypes.
A threshold parameter for the filter with a sensible default e.g. 60% posterior probability support for in congruent phenotypes.
This filter is unlikely to filter anything that is not filtered by the posterior probability filter (unless number of chains >= 10) however it is useful for downstream analysis of why a given locus/sample is not assembling.

This would require calculating a per chain posterior distribution and checking that the mode phenotypes from each posterior that exceed the specified threshold support are identical.
The code for this filter could be ci<theshold> defaulting to ci60 "Chain incongruence with 60% posterior support".

Plot haplotypes within pedigree

Plot pedigree of samples with haplotype graphics as/in nodes.

Best done with graphviz (python-graphviz).

Considerations:

a CLI would need to match up pedigree with assembled genotypes
need to match sample aliases (i.e. used in vcf) to pedigree items
biological replicates are common so multiple vcf samples could match single pedigree item
- could use multiple nodes in a graphviz subgraph to collect replicates together
pedigree data file options:
- .ped format could work but has limited options for relationships (mother/father) and no ploidy or sample aliases
- .vcf can include pedigree but only for samples with genotype data and no aliases
- a simple custom tabular format?
experience with pedigrees in graphviz show that the output tends to get very wide so might be best to have haplotypes run top to bottom to reduce width

Basic data pipeline:

pedigree data would be read into a networkx Digraph
- pedigree item is the node id
- if used, vcf sample labels would be as a set of strings handled as an attribute of the node
vcf data read in with pysam
- probably use haplotype vcf rather than phased sets so there is a single VariantRecord to work with
function takes the networkx DiGraph and the VariantRecord as arguments and returns a graphviz DiGraph

Rename Repo

HaploHelper was originally started to provide data structures for analysis of output of other haplotype-assemblers.
It's now focused on its own assembly methods and should have a better name, possibly "HaploKit"

Zero-pad month and day in VCF creation date

E.g.: ##fileDate=202084 should be ##fileDate=20200804

Merge in biovector functionality

Remove the dependency on biovector library by merging in a subset of it's functionality.
Biovector is only serving this library and has many unnecessary features.

Main changes:

support only integer and probabilistic encodings
encode/decode only from integer encoding
remove alphabet classes and instead use integer.encode(strings, alphabet='biallelic')

Add into source tree:

- encoding
    - integer
        - sequence.py
        - stats.py
        - transcode.py
    - probabilistic
        - sequence.py
        - stats.py
        - transcode.py
- kmer.py
- mset.py

Consider giving sub-modules more distinct names like int_array.py and prob_array.py

Also consider moving kmer.py into integer (i.e. explicitly only support integer encoding)

Rename to MCHap

HaploKit implies a more general library with multiple assembly methods (which was the original intention).
The current library is almost entirely focused on supporting the de novo assembly CLI which uses MCMC.
While other related methods may be added, the core tool will always be MCMC based so I think a more specific name should be used to reflect this.

Make merging overlapping SNPs optional

Currently this is done automatically but it should really be an option to turn on.
If a duplicate SNP is encountered without this option enabled then an error should be raised.

https://github.com/PlantandFoodResearch/HaploKit/blob/master/haplokit/io/loci.py#L197

Refactor with numba code

Refactor code as part of numba branch merge.

- assembly
    - complexity.py
    - likelihood.py
    - inheritance.py    # inheritance calculations for parental models
    - bayesian
        - denovo_assembler.py
        - step
            - mutation.py
            - structural.py
    - bruteforce 
        - denovo_bruteforce.py
        - genotype_bruteforce.py
- io

Handel casses of duplicate or same location variants

At the moment these result in an error:
ValueError: Reference allele does not match sequence at position ... because the first instance alters the char that the second instance checks.
These cases should result in an unambiguous error.

Cache conditional probs for current state in de novo assembly

In de novo assemblies with good read depth the genotype state often changes vary little or not at all during the MCMC.

Rather than recalculating the conditional probabilities between steps that are identical in state, the conditional probabilities of the current state could be cached.

This would be simple for mutation steps but more complex for structural steps due to the random intervals.

Mutation step
Genotype

genotype = [
    [0,1,0,0],
    [0,0,1,1]
]

Conditional probabilities are stored for every possible allele

cache = [
    [[0.9,0.1],[0.1,0.9],[0.9,0.1],[0.9,0.1],
     [0.9,0.1],[0.9,0.1],[0.1,0.9],[0.1,0.9]]
]

Structural step
Store a dictionary of interval bounds to conditional probs

{
    (0, 3): [0.2,0.1,0.7],
    (2, 4): [0.1,0.9],
}

Cache management

This is the difficult part because you need clear the cache each time the genotype state changes even if it changes in a sub-step

Caching may incur some cost on variable chains, i.e. those with low read depth.
If this is an issue then caching could be disabled for samples with low read depth.

N most recent states
This idea could be extended to n most recent states but that would get very complicated and with high read depth most of the benefit is from the current state only.

Generalise allelic encoding

Related to #11

Currently BioVector uses separate fixed width encodings for different levels of allelism.
This isn't ideal because if a single triallelic SNP is used then the entire assembly locus has to be treated as triallelic.
The cause of this is that the probabilistic encoding requires fixed row-vector lengths for the whole locus.

A better solution would be to allow nan padding of probabilistic encoded reads e.g;

[0.1, 0.9, nan],
[0.9, 0.1, nan],
[0.1, 0.1, 0.8]

This would require flexibility in the specification of allele number when converting from integer to probabilistic encoding.
as_probabilistic(array, vector_size):
vector_size should handle an array of integers rather than a single integer.

This would improve flexibility and remove need for multiple encoders.
It would also allow for future adaptation to indels.

Valid Trio Check

This can be done with the TrioChildInheritance class

from haplohelper.inheritence import TrioChildInheritance
def valid_trio(mum_haps, dad_haps, kid_haps):
    trio = TrioChildInheritance(mum_haps, dad_haps)
    for hap in kid_haps:
        try:
            trio.take(hap)
        except:
            return False
    return True

valid_trio(mum_haps, dad_haps, kid_haps)

Consider replacing RASSIGN field with AD field

Currently we define a custom RASSIGN sample field which is the (float value) "Approximate number of reads assigned to each haplotype by MEC score".
This is somewhat similar to the AD sample field in the VCF spec which is defined as the (integer values) "Read depth for each allele".
NOTE: RASSIGN estimates counts for alleles in the called genotype (including replicate alleles) where AD is calculated for all known alleles at that locus.

AD could also be estimated by MEC based assignment and the results either rounded or floored to produce integer results.
This would involve storing sample read distribution arrays until all samples have been assembled and called.

Use .ped file instead of bio-targets + ped metadata in pedigree graphs

Don't want to require custom file formats in first release and currently the additional info in the bio-targets file is not needed.

Check calculation of unique genotypes

u_haps and ploidy appear to be reversed in the denominator.

Check by direct calculation of unique genotypes

Add prefix and suffix options for pedigree graph file names

Dosage change steps bias posterior probabilities

This appears to be because options for dosage change steps reflect a different prior distribution.
This is likely to be because the default prior is uninformative across haplotypes and hence there is a lower probability of homozygous genoypes compared to heterozygous genotypes.

The dosage change steps alter dosage with a flat prior across all dosages (within the interval) which appears to be the source of the bias.

This does not appear to be an issue with recombination steps as the dosage within the interval is not altered.

prior for inbreeding

Currently the assembly process has a flat prior across all haplotypes.
This is effectively a prior expectation of out-crossed individuals.
This prior have the most significant impact on samples with low read depth (i.e. less evidence to update the posterior with).

It would be worth considering how to specify and use a prior belief of inbreeding.
Ideally this would be an intuitive parameter such as the expected inbreeding coefficient of the sample.

For each proposed step in the MCMC the likelihood function would need to incorporate the current dosage and the dosage of each proposed step.

Identify haplo-tagging SNPs

Depends on #72

When reducing haplotype-blocks to there constituent SNPs (#72), many SNPs will be redundant for the purpose of differentiating among the present haplotypes. It would be useful to identify the minimum set of SNPs required to differentiate among haplotypes and then tag or remove the redundant SNPs from the output VCF.

DeNovoBruteAssembler always produces lowest probability for 0 vector genotypes

Replaced ndarray.tostring calls with ndarray.tobytes

Numpy method tostring has been depricated in favor of tobytes which should behave the same.

Change code for Phenotype Quality

The PQ code is reserved for Phase Quality

Remove loci file

Currently need to produce a "loci" file from combination of vcf, bam and fasta.
This creates an extra step and new file standard.

It would be better to simply pass the vcf, bam and fasta files directly to the assembler which will also improve flexibility.

Rename denovo subtool to assemble

Lets keep things obvious, plus 'denovo' sounds like it does variant discovery which it doesn't

Parallel tempering

Parallel tempering is commonly used in phylogenetic MCMC analysis to improve mixing between extreme posterior peaks.
This is likely to be an effective way of solving the same problem in haplotype assembly, particularly in the presence of cryptic polyploidy.

In terms of implementation this would require multiple linked MCMC simulations to be run simultaneously in place of a single simulation. Given the coarse grain parallelism used in MCHap this could simply be done in in a for-loop.

Pseudo-code:

temperatures = [..., 1]  # lowest to highest inverse temeratures
genotypes = array[n_temperatures, ploidy, n_pos]  # state of each chain
likelihoods = array[n_temperatures]  # log-likelihood of each chain
genotype_trace = array[n_steps, ploidy, n_pos]  # record state of cold chain only
likelihood_trace = array[n_steps]   # record likelihood of cold chain only

for step in n_steps:
    for temp in temperatures:
        genotype = genotypes[temp]
        llk = likelihoods[temp]
        likelihoods[temp] = mutation_step(genotype, likelihood, ..., temp)
        likelihoods[temp] = structural_step(genotype, likelihood, ..., temp)
        chain_step(genotype, previous)
        ...

add temperature parameter to mutation step functions
add temperature parameter to structural step functions
implement chain-swapping step function
implement MC3 method for an array of inverse-temperatures ([... < 1, 1])
add parameters to CLI (default to no additional temps)

"Regular" parallel tempering requires (inverse-) temperatures to be specified by a user which can be difficult to tune but automatic tuning of temperatures may be possible following the implementation in BEAST2.

Record SNV positions in haploypes using 1-based indices

The variable position tag VP currently uses 0 based indices from python.
These should be converted to 1-based to be consistent with the VCF spec: "POS - position: The reference position, with the 1st base having position 1."

Investigate supporting fixed-length MNP variants

Freebayes likes to output MNPs e.g. TGC TAC,TGG.
Currently the best approach with these is to 'atomize' them into SNPs before assembly (removing redundant bases).

There are two options to support these without any user pre-processing:

automatically atomize them
generalize the encoding of read distributions to include MNPs

The second option would technically be the most computationally efficient because it only requires fewer variants.
The first option assesses all possible combinations of the atomized SNPs, not just the alts of the MNP.

The main complexity of encoding an MNP is how to handle the case where a read does not extend the entire length of the MNP.

Add missing dependencies

plotting:

networkx
graphviz

testing:

pytest

Consider adding requirements_dev.txt and env.yml

Add n_haps > ploidy across all chains filter

Related to #68 this would tag assemblies in which the number of unique well supported haplotypes found across all replicate chains is greater than the specified ploidy.
This will only add a filter tag to anything already filtered by #68 but the tag will be useful for diagnosing copy-number variation.

Multi-chain models and convergence metric

Need to implement multi-chain MCMC models and metric(s) of convergence between chains.

Multi-chain is strait forward
- default to 2 or 4 chains? more chains is safer
- run chains in parallel? dask or numba?
- combine chains for posterior
- store chains in single array
Convergence metric
- Needs to be a single or multiple yes/no calls for use at scale
- Call included in VCF output as a Filter

Fix non-variable alleles in de novo assembly

When using a set of known SNPs across all samples there will be many samples that are compleately homozygous at a given SNP.

An (optional) optimisation would be to remove these non variable SNPs from the specific assembly of that individual.

This should be done in a way that the trace is still comparable to other samples, i.e. still include the removed SNP in the trace.

This optimisation would be used in combination with a minimum read depth.

DeNovoBruteAssembler genotypes have haplotypes in reverse order

Preference is to have them in sorted by alleles in ascending order.

e.g. current result:
['1011' '1010' '1001' '1001']
preferred result:
['1001' '1001' '1010' '1011']

Could simply be done by applying the standard sort function (after generating them) but it would be more efficient to generate haplotypes in preferred order.
The relevant code need some work anyway.

Allow for reference haplotypes with uncertainty in models

This is only an issue for models that take reference haplotypes as an argument.
These should allow for uncertain ref haplotypes i.e. with floats.

The .trace_haplotypes() methods for these models should not use bv.mset.sort_onehot().
Ideally the haplotypes should be sorted only by the categorical calls.
If necessary they can use bv.mset.sort_binary() for floating point inputs

Record random seed

This should be recorded in the VCF output along with the run command to ensure reproducibility.

This is related to #16

also see: numpy/numpy#9650

Record and filter on read count

Currently read depth DP is recorded for each sample and used for filtering.
The reported DP is the average of read depths as each variable position within the target region.
This leads to an edge case where DP is null when there are no variable positions within the region which also results in a call of hom for the ref allele 0/0 with probability of 1 / qual of 60.

These calls should be filtered at the sample level if there are too few reads for a sample but currently this is not happening because DP is not reported. Calculating mean depth across all positions in the region would be expensive.

The total number of reads within the region is still pulled out of the bam so this can be used as an alternative metric to filter on.
Consider calling this variable NR and using it as a default filter with a minimum threshold of ~5

Trio decent possibilities

Add method to TrioInheritance class to return decent possibilities.

example

mum = [A, A, B, C]
dad = [A, C, D, D]
kid = [A, A, B, D]

Then IBD options are:

[
    [0,1,4]
    [0,1,4]
    [2]
    [6,7]
]

But one of the A's must have come from dad (unless double reduction).
So produce an sequence of IBD options?

Pedigree informed sampler

The current assembly tool is not aware of population structure and uses a prior that all individuals are unrelated and out-bred.
These priors will be less accurate with more advanced material where they can result in the false identification of new haplotypes based on reads containing errors.
Incorporating population structure into haplotyping will result in better priors improving haplotype recognition and dosage calling.

It will be simpler to (initially) incorporate pedigree structure into re-calling from a known set of haplotypes.
In each substep a haplotype is chosen from an individual genotype and the likelihood is calculated for each possible genotype found by replacing that haplotype with one of the known haplotypes. If all haplotypes are treated as being equally likely (flat prior) then this is equivalent to the the current method.
The prior probability of a genotype can be calculated given the current genotypes of immediate relatives. In each case this can be calculated as the probability of observed genotypes within a a trio given an expected pedigree error rate which itself can be broken down into non-paternity, non-maternity and sampling-error. The prior for a given genotype is then the joint probability of all trios that include the genotype in question.
A population level prior on haplotype frequency within the population could also be applied. In the case of a missing parent or pedigree error then haplotypes have priors proportional to the population frequency.

Implementation

Encoding read probabilities for MNPs (treat entire haplotype as a single variant)
Function to enumerate the combinations of gametes that can produce a given genotype
Function to calculate probability of a gamete given a parental genotype, pedigree error rate, double reduction rate, and number/frequency of haplotypes expected in the population.
Function to calculate probability of observing given genotypes in a trio given a pedigree error rate, double reduction rate, and number/frequency of haplotypes expected in the population.
Format/Parser for mixed ploidy pedigree.
Application with CLI