onlinephylo / sts Goto Github PK

Sequential Tree Sampler for online phylogenetics

License: GNU General Public License v3.0

Python 3.81% Shell 0.05% C++ 88.37% C 0.06% Processing 0.19% CMake 1.81% Makefile 1.25% Mathematica 4.46%

phylogenetics sequential-monte-carlo online-algorithms

sts's Introduction

Sequential Tree Sampler (STS)

The sequential tree sampler implements a prototype of online phylogenetics, updating a posterior distribution generated by MrBayes with new sequences. The algorithm has been described and its performance evaluated in a manuscript. Also available as a preprint. The scripts used to generate the figures can be found here.

Dependencies

smctc - included as git submodule (git submodule update --init)
lcfit - included as git submodule (git submodule update --init)
Bio++ version 2.2.0 core, seq, and phyl modules. Note that debian & ubuntu up to 16.04 include v2.1.0 which is too old. Bio++ should be installed from source using the bpp-setup.sh script on these systems. Alternatively, the source code of version 2.3.0 for each module can be dowloaded from github in the releases section
cmake
gsl version 1 or 2
nlopt
boost
beagle version 2.1 (Optional but recommended)
google test this is libgtest on debian/ubuntu (Optional)

Compiling

Install dependencies
run make

Binaries will be build in _build/release

Adding taxa to an existing posterior

The tool sts-online adds taxa to an existing posterior tree sample. sts-online operates on a fasta file and tree file in nexus format. The fasta file must contain an alignment with a superset of the taxa in the tree file.

DNA substitution models: Jukes-Cantor (JC69), generalised time reversible (GTR), Hasegawa Kishino Yano (HKY), Kimura (K80). Protein substitution models: JTT, WAG, LG.

Example invocation with JC69

sts-online -b 250 -p 2 --proposal-method lcfit 10taxon-01.fasta 10tax_trim_t1.t 10tax_trim_t1.sts.json

In this example, we use an alignment containing 10 sequences and a posterior sample of trees generated by MrBayes with an alignment that does not contain the sequence labeled t1. sts-online ignores the first 250 trees from 10tax_trim.t1.t and uses a particle factor of 2. The 10tax_trim_t1.sts.json file will contain the updated trees.

Example invocation with GTR

sts-online -b 250 -p 2 --proposal-method lcfit 10taxon-01.fasta 10tax_trim_t1.t 10tax_trim_t1.sts.json -P 10tax_trim_t1.p -M GTR -o 10tax

In this example, we use again an alignment containing 10 sequences and a posterior sample of trees generated by MrBayes under the GTR model with an alignment that does not contain the sequence labeled t1. sts-online ignores the first 250 trees from 10tax_trim_t1.t and parameters from 10tax_trim_t1.p and uses a particle factor of 2. The 10tax_trim_t1.sts.json file will contain the updated trees and parameters in the json format. Usin the option -o, two additional files 10tax.log and 10tax.trees containing parameters (csv file) and trees (nexus file) will also be created.

sts's People

Contributors

Stargazers

Watchers

Forkers

sdwfrost dfornika pythseq

sts's Issues

post-processing tool

Now that we have serialization down, it will be great to have a post-processing tool.

Two things that will be fun to see:

average per-generation survival (equivalent to acceptance prob for MH)
expected MRCA depth (somewhat eqivalent to MCMC autocorrelation function)

For some reason I also think that the average lifetime of a particle would be neat. Note that it's susceptible to edge effects when the generations run out.

We also need a name. @koadman, any clever suggestions?

consider using a two-parameter lcfit?

This would be along the lines of

https://github.com/matsen/pplacer/tree/294-likelihood-fit

Get models from bio++

particles with incomplete trees

When running data/empty.fasta or reduction of this file to only its first 3 sequences, sts reports final generation trees that contain only 1 or 2 sequences. This is on current master, 1573c48.

Introduce an edge class

Which would just be a phylo_node, dist pair.

This would make branch length proposals/moves a little cleaner.

simulations to help understand how good the "natural extension" is

The "natural extension", again, is taking the product of likelihoods in a forest.

It still seems open to me how to properly deal with these things. For example, do we do OK likelihood-wise by just joining leaves that both deviate from the stationary distribution a lot (LBA?!).

It's not clear to me that we have the best tools for evaluating this, but that appropriate tools wouldn't be hard to make. Seems to me like

Some way of spitting out likelihoods of partial states
A way of parsing a "newick forest" and calculating likelihood on it.

The second option doesn't seem like it would be too hard, and might be handy. Thoughts?

Support sequence quality annotated alignments?

Current sequencing instruments generate a per-base quality score that are intended to estimate the probability that the nucleotide is correctly represented in the sequence. This information could be carried forward into the phylogenetic likelihood with relative ease by setting the tip partials appropriately.

Similar information could be associated with protein alignments when the sequences derive from translations of quality-annotated nucleotide reads.

cubic spline ML optimization

To discuss and only pursue if it seems useful.

MCMC move for branch length parameters

Branch lengths/node heights in current solution sets appear fit the actual data poorly. These might improve considerably with an MCMC move.

a small but automated test suite

It would be hugely useful to have some automated way to evaluate the effect of code changes on the quality of inference on real datasets. Tree topology, branch-length-aware tree distance, compute time, memory, scaling in number of sequences, etc.

i had something simple & quick in mind, something that could be run in a couple minutes before committing code to validate the proposed code edits. spot check for severe accuracy/speed regressions, etc.

worry about likelihoods underflowing in SMCTC

Our likelihoods are remarkably low, making code such as

for(int i = 0; i < N; i++)
    dWeightSum += exp(pParticles[i].GetLogWeight());

(from sampling.hh) potentially hazardous. If folks agree I'll raise an
issue on the smctc repo and see what we can do.

make terminology consistent

As much as possible between STS, BSJ, and SMCTC.

least-squares fitting of JC likelihood function

For branch length optimization.

Read model / rate distribution parameters from posterior

sts-online assigns a substitution model and rate distribution to each particle, but there's nothing to:

Determine the model used
Read parameters for the model from a posterior trace (lining up with trees)

Illustrate SMC tree inference with animation

When sampling a rooted ultrametric tree, it might be possible to do a densitree style animation of the inference process because calculating node positions for sensible tree overlay visualizations would be easy. This could be implemented in processing and sent over to the DTRA-folk.

Merge proposals are non-uniform

Sampling from three-taxon trees without data:

(C, (A, B)) 2264
(A, (B, C)) 6743 !
(B, (A, C))  993

propose nodes to merge non-uniformly

At early generations there are a very large number O(n^2) of possible node pairs to merge and a very small number of good merge pairs O(1). This causes most of the particles produced in early generations to have low likelihood.

Rather than proposing joins uniformly we could bias the proposal toward joins that produce a high LL under the natural forest extension. In general these will be nodes with short branches in the true tree. One way to obtain a proposal distribution is to calculate all internode distances from a FastTree guide tree and use the inverse of distances to generate a sampling distribution for leaf pairs. At each merge, a pair could be sampled. If they are part of different trees in the forest, their trees could be joined at the root. If they belong to the same tree, zero or more additional proposals could be made until a valid merge is found. If a valid merge is not found, this proposal could fall back to the uniform merge.

Use lcfit for attachment branch length proposals?

Currently, optimizing attachment location using Brent's method, proposing a branch length from exponential around ML value for pendant branch, and from truncated normal around attachment location.

BEAGLE has moved to version 2, whereas we are using hmsbeagle-1.

@koadman do you have any insider knowledge concerning what version we should use?

Evaluate semantics of state modifications

When the model state is modified by an MCMC move or SMC extension, the affected components can either be stored in a newly allocated object or remain in-place but have values change. The former approach follows "copy-on-write" semantics, the latter does not.

In copy-on-write, summaries and metadata on an object can be keyed on the object's memory address and will remain valid for the lifetime of the object. This is how Online_calculator's log likelihood cache currently works (* see below).

One major disadvantage of the copy-on-write approach becomes apparent when considering continuous rate parameters in phylogenetic models. Large parts of a phylogenetic model have conditional dependency on single continuous parameters such as mutation rate. When mutation rate changes the LL at a forest node becomes invalid. Because the LL are keyed on object memory address, preserving the functionality of this cache approach would require us to copy not just on write, but on a write to any model parameter for which a conditional dependency exists. For large trees it might become expensive to do such copying (though possibly still cheap relative to calculating log likelihoods), not to mention tedious. For example, why should one have to write and invoke code for a whole forest traversal & copy just to change one value. I will call this approach copy-on-dependency-write.

Before we start stacking in model parameters, it seems prudent to evaluate whether we want to continue with the copy-on-dependency-write approach.

There are some alternatives. If we want to allow unfettered write access to model parameters, we might actively notify objects that have registered an interest in that model parameter. This is the java-esque EventListener approach. The likelihood cache might register as a listener for nodes, which might also listen to the rate. When the rate changes the nodes would hear about it and have to propagate a message to the likelihood cache which could then delete any cached likelihood.

Another possibility is some kind of generic versioning system (e.g. a monotonically increasing integer) for objects & their dependencies. In this approach, each model parameter would contain a "revision" object which consists of an integer version number and also pointers to version objects for any model parameters on which a conditional dependency exists, along with the most recent version of those dependencies observed by the present object. Every time an object changes, the version number is incremented. Cached values such as the LL could still be keyed on object address, but the value stored would be both log likelihood and the associated object version. When deciding whether the cached likelihood is valid, the calculator can request the current version number of an object and compare it to the version of the cached likelihood. Upon a request for its object version, the model parameter will in turn request the version of any dependency variables and compare that version to what it has recorded as the most recently observed version of that dependency variable. If the dependency is newer, the version is incremented. I will call this approach lazy parameter versioning.

As compared to the EventListener approach, lazy parameter versioning has the advantage that many parts of the model can be updated without forcing a traversal of the conditional dependency graph for each and every change.

There are surely other ways we could handle this. Does anybody have other suggestions? In the absence of other opinions I would be inclined to go for lazy parameter versioning. In either the EventListener or lazy parameter versioning approaches, we could implement this as a base class for all model components that contains a versioning system and a means to register the dependencies among objects.

(*) In the current code we follow copy-on-write for branch length changes to nodes, mostly, except for Eb_bl_proposer which can use Online_calculator::invalidate() to clear a cached likelihood entry.

support for [forward,backward] proposal density in the particle weights

Currently it seems like smctc assumes proposals are uniform. We would either need some interface gadgetry to record the proposal density or need to factor these in to the value passed to addToLogWeight()

A general system to store particle metadata

Metadata about particles is useful and currently includes things such as a cached log likelihood, but could also include a cached ML distance estimate or other information. Currently metadata is stored outside the particle, e.g. in OnlineCalculator and requires the class containing the metadata to be notified of particle deletions so stale cache data can be cleared. Currently this is done by storing a reference to the class that needs to be notified inside the particle, and this approach does not scale to an arbitrary number of metadata values and container classes.

Approach 1:

Store metadata inside the particle itself. An interface to fetch particular bits of metadata needs to be devised. This could be as simple as an unordered_map going from some key type (string? a compiler-mangled class name?) to a base class shared_ptr. The class creating metadata would maintain a set of weak_ptr's to the metadata objects and before accessing a particular piece of metadata would check for metadata validity using weak_ptr::expired(). This allows the metadata to be deleted when particles are deleted without needing to actively notify the class creating metadata.
One disadvantage of this approach is that creating per-particle caches will incur considerable memory overhead.

Approach 2:

Create a global metadata store via a static instance. This would be an unordered_multimap from particle pointer to metadata. e.g. unordered_multimap< particle*, pair< string, metadata_base* > > This solves the stale cache problem by allowing a single global metadata store to be notified at time of particle/node/etc deletion. This approach may have more problems with multithreading and concurrency than the first approach.

Approach 3:

Maintain metadata inside the generating class as it is currently done, associating the metadata with a weak_ptr to the object on which the metadata is stored. Before accessing the metadata, check for expiration of the weak_ptr. This approach is currently slightly frustrating because hash functions are not defined for std::weak_ptr, apparently for no reason other than lack of time from the c++ steering committee:
http://stackoverflow.com/questions/4750504/why-was-stdhash-not-defined-for-stdweak-ptr-in-c0x
so the hash key could be the memory address

This approach has the advantage of providing a lot of flexibility in designing metadata storage, but the disadvantage that caches might grow quite large because they are not actively trimmed back to surviving particles. Size could be managed with one of the classic MFU-approximation cache strategies, and a generic implementation could be made for arbitrary data/metadata.

The immediately motivating use case is that when node merges are proposed non-uniformly, e.g. using a distance guided approach, the same pair gets proposed many times at each generation. Calculating the ML distance among the pair is rather expensive operation using e.g. 10 iterations but results in much higher particle LL. A cache would allow the ML dist to be calculated once for the node pair and saved rather than recomputed dozens or hundreds of times.

Allow parameter specification from file

...rather than command line flags

astyle fails with out-of-memory condition on machine with 4GB memory

make style not working on current master using astyle 2.01 with ubuntu 12.04.

To what extent should we optimize branch lengths for previously-merged particles?

    |
x   |
    |
   / \
a /   \ b
 /     \

Starting from most extreme conservation:

set them once and leave them
set the total of a and b once, set the attachment point later (this could be implemented by only storing the total as part of the join)
upon attachment, set a and b and x with either a fixed probability or as a MH step
upon attachment, always set a and b and x (we could weight the joined particles using the ML join branch lengths or something)

where "set" means either optimize, or draw with an optimized mean, or just draw.

We can duplicate particles on change, so we don't have to worry about having them duplicated ahead of time.

eliminate beagle IDs in phylo_node

The beagle IDs are not part of the model state and are used only to help speed up likelihood calculation. It would be cleaner for model state classes to only store model state info. An alternative design could have shared_ptr< phylo_node >'s register and unregister with the likelihood calculator in order to get buffer space, and the likelihood calc could keep a mapping:

unordered_map< shared_ptr< phylo_node >, int > node_buffer_map;

to determine whether a node currently has buffer space.

the new functions in the likelihood calculator would be:

void register_node( shared_ptr< phylo_node > n ) {
    if(node_buffer_map.find(n)!=node_buffer_map.end()) return;  // or this could be an assert
    node_buffer_map[n] = get_id();
}
void unregister_node( shared_ptr< phylo_node > n ) {
    if(node_buffer_map.find(n)==node_buffer_map.end()) return;  // this could be an assert.
    free_id(node_buffer_map[n]);
}

Add command-line flag for resampling strategy

Track induced split frequencies

This is just to say that we are implicitly generating interesting data with respect to the question of if an n-taxon tree posterior should sit inside an (n+k)-taxon posterior. I.e. relating to this figure:

That is, we could be tracking induced split frequencies for the first n taxa in the generations after n. If we are indeed getting a good posterior, that would be interesting to watch!

How are we going to optimize model parameters?

Connor's option-- do MH moves that impact all particles at once.

Erick's option-- for each particle with some probability, propose a MH move for the rate parameters. This duplicates the particle when accepted.

License

I've packaged sts for distribution via conda.

https://anaconda.org/dfornika/sts

I'd like to push it into the bioconda channel, but they require that a license is specified. Is sts distributed under the GPL?

Verify log likelihoods of trees sampled from sts

We should verify that trees from the final generation of sts (that is, forests of size 1) have lnLs matching that computed by Bio++.

I'm suspicious.

topology-guided merge proposals

As described in smctex. We need to figure out about the if's and how's of matrix m.

put OnlineCalculator out of its CamelCase misery.

'nuff said.

check out performance of SMC on mixture distributions

... a la Mossel and Vigoda 05, which is now in smctex repo.

decide on resampling parameters

The SMCTC documentation says:

To control resampling behaviour, use the
SetResampleParams(ResampleType, double) member function. The first
argument should be set to one of the values allowed by ResampleType
enumeration indicating the resampling scheme to use (see Table 3) and
the second controls when resampling is performed. If the second argument
is negative, then resampling is never performed; if it lies in [0, 1]
then resampling is performed when the ESS falls below that proportion of
the number of particles and when it is greater than 1, resampling is
carried out when the ESS falls below that value. Note that if the second
parameter is larger than the total number of particles, then resampling
will always be performed.

Right now we have:

Sampler.SetResampleParams(SMC_RESAMPLE_STRATIFIED, 0.99);

meaning that we only resample when the ESS drops below 0.99 N. I don't
see any reason why we shouldn't resample every time. It's the way
described in BSJ, and it's quite cheap compared to likelihood
computation.

I'm going to read up on stratified sampling now.

typedef for std::shared_ptr< sts::particle::phylo_node >

I've been typing that an awful lot and it seems like it might be handy to have just
typedef std::shared_ptr< sts::particle::phylo_node > sts::particle::node;

and to actually use it everywhere. Same for sts::particle::phylo_particle

add branch length priors (via command line flag)

As described in the BSJ paper, and arbitrary prior can be specified by just incorporating it into the $\gamma_*$ function as a multiplier on the phylogenetic likelihood. We should have a mechanism for doing this.

refactor code so that likelihood unit tests can be run

Right now too much is being run in main for that to be done conveniently.

More efficient beagle memory usage

Beagle currently stores allocates a large block of memory and persistently stores one buffer per phylogenetic tree node. This could be reduced to persistent storage of one buffer per tree root in the current particle set, along with a collection of O(N) buffers for scratch space to do full tree peeling when necessary, e.g. when rate parameters change.

serialization/deserialization of state

Will be necessary if we write a tracer-type analyzer, but also will be needed for online inference or storage of runs.

It seems like doing this within a more general framework would be nice, where particles know how to serialize what is special to themselves, but a more general function coordinates the overall outputting of things-with-ancestors.

Online: cache likelihoods

Currently we're starting the likelihood calculation fresh for each tree. This probably doesn't hurt much when starting from the MrBayes posterior, but will be inefficient after resampling.

MCMC move diagnostics/logging

@cmccoy just found that adding extra MH steps at each generation can considerably improve the solutions. this is sweet and it would be nice to have some means to figure out how many to add, e.g. where do we reach the point of diminishing returns? One way to do this would be to log the logL after each MH step to our log file and eyeball it for a handful of datasets. Another might be to have some kind of automated convergence diagnostic that can detect when MH steps are no longer improving the logL by much.

Propose attachment uniformly over tree length, rather than uniformly over edges

be able to specify random seed from command line

inference ll != test ll

Ghosts in the machine? The bpp ll is now confirmed using sts' ll
calculator in the likelihood unit tests. 1043 != 11257.

stoke bppsim/mixture_sim ‹59-infer-ll*› » cat run_sts.sh
GSL_RNG_SEED=1 ../../../_build/sts combined.fasta -o sts.out
sort -n -r sts.out | uniq > sts.sort.uniq.out
head -n 1 sts.sort.uniq.out | cut -f 2 > sts_on_combo.tre
stoke bppsim/mixture_sim ‹59-infer-ll*› » head -n 1 sts.sort.uniq.out
-1043.74        (((s5:1.27449,s2:1.61169):0.00636868,s1:0.00629971):1.2238,(s3:0.0769676,s4:1.09666):0.185176);
stoke bppsim/mixture_sim ‹59-infer-ll*› » bppml param=check_sts_tree_on_combo.bpp
******************************************************************
*       Bio++ Maximum Likelihood Computation, version 1.5.0      *
*                                                                *
* Authors: J. Dutheil                       Last Modif. 07/02/11 *
*          B. Boussau                                            *
*          L. Gueguen                                            *
******************************************************************

Parsing options:
Parsing file check_sts_tree_on_combo.bpp for options.
Alphabet type .........................: DNA
Sequence format .......................: Fasta
Sequence file .........................: combined.fasta
Sites to use...........................: all
WARNING!!! Parameter input.sequence.max_gap_allowed not specified. Default used instead: 100%
Number of sequences....................: 5
Number of sites........................: 2000
Input tree.............................: user
Tree file..............................: sts_on_combo.tre
Number of leaves.......................: 5
Wrote tree to file ....................: sts_tree_on_combo.ML.dnd
Branch lengths.........................: Input
Heterogeneous model....................: no
Substitution model.....................: JC69
Frequencies initialization for model...: None
Rate distribution......................: Uniform
Number of classes......................: 1
- Category 0 (Pr = 1) rate.............: 1
WARNING!!! Tree has been unrooted.
Initializing data structure............: Done.
Number of distinct sites...............: 157
Initial log likelihood.................: -11257.5402801006
Wrote tree to file ....................: sts_tree_on_combo.ML.dnd
Log likelihood.........................: -11257.5402801006
Output estimates to file...............: sts_tree_on_combo.params.txt

Posterior rate distribution for dataset:

Pr(1.000000) = 1.000000


Alignment information logfile..........: sts_tree_on_combo.infos
BppML's done. Bye.
Total execution time: 0.000000d, 0.000000h, 0.000000m, 0.000000s.
stoke bppsim/mixture_sim ‹59-infer-ll*› »

master branch inevitable core dump

Am I doing something wrong?

ermine re/sts ‹master› » src/sts data/thirty.ma
Reduced from 1579 to 1282 sites
[1]    4503 segmentation fault (core dumped)  src/sts data/thirty.ma
ermine re/sts ‹master› » src/sts --no-compress  data/thirty.ma
[1]    4514 segmentation fault (core dumped)  src/sts --no-compress data/thirty.ma
ermine re/sts ‹master› » src/sts --no-compress -p 1000 data/thirty.ma
[1]    4522 segmentation fault (core dumped)  src/sts --no-compress -p 1000 data/thirty.ma
ermine re/sts ‹master› » src/sts --no-compress -p 1000 -m JCnuc data/thirty.ma
[1]    4530 segmentation fault (core dumped)  src/sts --no-compress -p 1000 -m JCnuc data/thirty.ma

multi-run convergence diagnostics?

programs like mrbayes assess convergence by comparing the marginal distributions of summary statistics across multiple runs. For example, the average standard deviation of split frequencies (asdsf) is a topology diagnostic that can assess whether multiple runs agree about the posterior probability of certain splits in the data. It seems like we could calculate similar diagnostics across SMC runs.

implement backward proposal probability

I think that @cmccoy and I figured out how it should work this AM and he's already implementing.

Look for tools for SMC diagnostics

A tracer for the SMC world would be ideal.