conifer

Installing from source

Requires: gradle, git, eclipse

Clone the repository
Type gradle eclipse from the root of the repository
From eclipse:
- Import in File menu
- Import existing projects into workspace
- Select the root of the newly created repo
- Deselect Copy projects into workspace to avoid having duplicates

Bayesian inference using Blang

You will first learn to use Blang, a new package where you can write Bayesian models in a way reminiscent to JAGS, BUGS, etc. The main advantage of Blang is that you can easily add other types of random variables. This will be important in this exercise as we will need tree-valued random variables.

Small example: predicting the number of future members of the human species

Before creating new types of random variables, let us start with an example using familiar, real-valued random variable to familiarize ourself with the main Blang concepts. To do so, we will follow the lines of the most basic version of the Doomsday argument.

Overview

Suppose you want to predict the number N of future members of the human species.

From archaeological records, we know there are about 0.06 trillion humans have been born so far. Call this our observation Y. Let us assume that Y is uniformly distributed between 0 and N.

So if we put a prior on N, we can compute a posterior on the quantity of interest, N - Y. Let us be optimistic, and let us put an exponential prior on N with a mean of 10.0 trillion humans.

Basic concepts in Blang

Let us look now at how we can write this model in Blang.

_{From:conifer.BlangSmallExample}

Blang provides a way of building a factor graph declaratively.

Recall that factor graphs are bipartite undirected graphs with two types of nodes: factors and variables. Each factor computes one factor in a target unnormalized density broken into a product of factors. Each variable holds one coordinate of a state.

For now, think of a factor as a prior or conditional probability density (although Blang can be used in other contexts such as for Markov random fields, we will focus on factor graphs induced from directed models in this tutorial).

Factors are declared using the @DefineFactor annotation (an annotation is a mechanism designed to extend the functionality of the Java language). Blang will look for fields with a DefineFactor annotations and build model composed of these factors.

We will look at the inner working of factors later on, for now we focus on how to use them. For example, in our example, there are two factors, one for the prior (an exponential density), and one for the likelihood (a uniform density).

Variables are fields within factors classes that have a @FactorArgument and @FactorComponent annotations (the differences between the two will be explained shortly). These fields can be of any type (more on that later), and they typically correspond to either the parameters or the realization of a probability density.

In order to use existing factors, the only thing to do with variables is to connect them across factors. In our example, this is achieved via Exponential.on(likelihood.parameters.max). Here Exponential.on(x) is just a a static function short-hand for creating a unit mean exponential density with realization x, new Exponential(x, ...). The consequence of doing this is that the realization field of the exponential prior is == (object equality, i.e. two references to the same location in the heap) to the realization field of the maximum parameter of the uniform density factor. Blang automatically discovers these relationships and use them to build edges in the factor graph.

Building an MCMC sampler

Obtaining an MCMC sampler from a model as specified above is very simple:

Make the class in which you declared the factor extend MCMCRunner.
In the main function, create an instance of that class, and call .run().

If you run the sampler, you will see an histogram created in posteriorOnMax.pdf, showing the approximate posterior over the number of future members of the human species under our simple model.

Differentiating parameters from realisations

As mentioned earlier, there are two types of factor arguments in our setup: parameters and realisations. As we will see shortly, it is useful to differentiate the two. This is done via the syntax @FactorArgument(makeStochastic = true), which indicates that the argument is a realisation (therefore making the variable stochastic/random).

See bayonet.distributions.Exponential.java and bayonet.distributions.Uniform.java for examples of this construction.

Specifying what should be sampled

All the latent (non-observed) random variables are sampled by default. Blang uses the following strategy to determine the random variables: the variable should appear as annotated by
@FactorArgument(makeStochastic = true) in one of the factors (see above).

A random variable can be either latent or observed. To mark a variable as observed, simply use @DefineFactor(onObservations = true) on the factor in which the observed variable is declared as stochastic.

Processing the samples

The simplest way of process the samples is to override the function process() as follows:

protected void process(blang.processing.ProcessorContext)
{
    samples.add(likelihood.parameters.max.getValue() - observation.getValue());
}

_{From:conifer.BlangSmallExample}

Extending Blang

Let us see now Blang can be extended beyond its (currently very minimal) set of built-in factor and variables. At the same time, some more advanced programming concepts will be introduced (generics, test units, annotations).

Creating a new factor

In this part, you will create a new type of factor to model Gamma densities.

Here are the main steps:

Create a class Gamma. Make this class implement Factor.
Create three fields of type RealVariable, and annotated with @FactorArgument: one for the realization, one for the shape, one for the rate. Note that these should also be public and final. The keyword public simply means they are accessible from functions outside of the current one. The keyword final means that the value of the field cannot be changed.
Implement logDensity(). You can the existing implementation provided by the math-commons project (see how I did it in the Exponential class).
Create a constructor.

That's it! We now have a working factor.

Before we move on, just a quick not on why we made the fields final. Recall that almost all variables in java are in fact disguised pointers to memory addresses in the heap (the only exceptions are so-called primitives, int, double, char, etc). This means that the arguments of factor graphs will always point to the same memory location in the heap. This ensures that if two different factors share a variable at the beginning, they will always share it throughout sampling. Note that even if the reference is final, we can still modify what is the contents of this memory location. Under the hood, this is what the MCMC sampler does (for example for real variables, via variable.setValue().

Improving the factor via object and generic programming

We will now do a few things to improve our implementation.

Static methods: It is often convenient to access the logic inside a method (function associated with an object) without always having to instantiate an object (new ClassName().function()). As a consequence, when possible it is good to pull the contents of method into a static function. See logDensity(double point, double rate) in Exponential.java. This way, it is possible to get the density using Exponential.logDensity(x,r), or even shorter, by adding import static bayonet.distributions.Exponential.* at the top of the file and then accessing the function with just logDensity(x,r).

Flexible parameterization: Distribution families are often parameterized using more than one type of parameterization. For example, we would like the same Gamma class to support both shape-scale and shape-rate parameterizations. Here are the detailed steps of how this can be done:

Create an interface called Parameters. In cases like here where an interface or class A is used mostly in the context of another one B, it is a good idea to write A inside the file B.java as well. To do this, simply use public static interface B {} (static here means that B does not have access to the fields in A; this is the best choice most of the time).
Create two classes ShapeScaleParameterization and ShapeRateParameterization using the same strategy, one where you copy the shape and rate from earlier (keeping their annotations), and the second where you create two new fields with again the appropriate annotations. Make all the fields final and create constructors. Also, make the two class implement Parameters.
Go back to the interface you have created, declare the signature of two methods, one that returns a rate, public double getRate(); one that return a shape.
Eclipse will now underline in red the two classes you have created. This is because they do not respect the contract set by their interfaces. Clicking on the corresponding red icon in the margin, and selecting Add unimplemented methods will create stubs for the missing methods. Fill these methods: in one class, this will just return the identity, in the other, it does some simple transformations.
Finally, in the Gamma class, replace the shape and rate fields by an instance of Parameters, called parameters. Replace occurrences of rate.getValue() by parameters.getRate(), and similarly for the shape. Update the constructors.
Annotate the new field parameters with @FactorComponent. This means that while parameters is not by itself a variable connected to this factor, it does contain variables connected to this factor (those in the implementation annotated by @FactorArgument. Note that this can be recursive (a factor component can contain further factor components, etc).

Generics: The class as-is is full functional, but using it in method definitions is a bit cumbersome: for example if we wanted to put an exponential prior on the rate of a gamma, we would need to write something like:

@DefineFactor 
Gamma gamma = new Gamma(x);

@DefineFactor
Exponential exponential = Exponential.on(((RateParameterization) gamma.parameters).rate);

This is because in its current form, the compiler only knows that parameters is of type Parameter, it does not known the precise type (ShapeScaleParameterization or ShapeRateParameterization).

To address this, we will use generics, which give us a way to declare which flavor of Gamma we use right when we declare it. The end result will be being able to write:

@DefineFactor 
Gamma<ShapeRateParameterization> gamma = new Gamma<ShapeRateParameterization>(x);

@DefineFactor
Exponential exponential = Exponential.on(gamma.parameters.rate);

To do this, proceed as follows:

Declare a generic type in the class declaration: public class Gamma<P extends Parameters>. Here P is what is called a generic. It is like a variable, but instead of holding a value, it holds a type (class or interface name). The part extends Parameters is called a type bound: it means that P is has to implement Parameters.
Change public final Parameters parameters; to public final P parameters;.

Method chaining: This is a technique that again makes it just cleaner to define models. This allows writing Gamma.on(x).withShapeAndRate(y, z). See examples at the bottom of Exponential.java.

Testing probabilistic programs

Writing probabilistic programs is tricky: subtle and not-so-subtle bugs can easily go unnoticed if one is not careful.

Blang offers some tools to help discover bugs in probabilistic programs. The main one, CheckStationarity, is based on the fact that forward sampling is often easier to implement than posterior sampling, and more
importantly, is typically based on different ideas than posterior sampling. Both types of sampling therefore share minimal code with the posterior sampling code. So the hope is that they do not share bugs either, so that discrepancy between the two methods can be detected.

Setting up this test is fairly straightforward:

// Creates a model similar to the small example, but with no observations
// (forward simulation requires that there be no observations).
TestBlangSmallExample runner = new TestBlangSmallExample();

// Build an MCMC algorithm on this model (this was done implicitly in runner.run() earlier,
// but this time we will need a bit more control)
MCMCAlgorithm algo = runner.buildMCMCAlgorithm();

// Print a summary of the model to make sure the model is what we intend to test
// The syntax ${someString} in this output denotes a random variable named someString.
System.out.println("Summary of model");
System.out.println("----------------");
System.out.println(algo.model);

// The number of MCMC sweeps to perform at each iteration of the test (more info below)
algo.options.nMCMCSweeps = 10;

// Actual code for setting up the test itself
CheckStationarity check = new CheckStationarity();
check.setShowSampleSummaryStats(true);
System.out.println("Summary statistics of the samples");
System.out.println("---------------------------------");

// Here: 1000 is the number of test iterations (different than MCMC sweeps, see below)
//       0.05 is a p-value threshold
check.check(algo, 10000, 0.05);

_{From:conifer.TestBlangSmallExample}

Optional question: look at the information in CheckStationary.java, and explain how it works and the theory behind it.

Note: this test can be quite expensive for large problems. Fortunately, most bugs can be detected on fairly small artificial datasets. So you should make your model/data you test against small enough so that you can frequently and quickly run test, but large enough so that you have code coverage, i.e. all parts of your code are needed to cover the examples being tested.

An important related principle is that you should strive to test against all small problem instances if possible. When it is not possible, do as many as possible, focusing on corner and degenerate cases.

Also, make sure you test your test: commit your code, introduce one small bug, and see if your test can detect it.

To conclude this section, a second test, which makes sure that the randomness is fixed, meaning that if you run two times the code with the same seed (as encoded in the Random object), then the exact same result should be obtained.

Very important for reproducibility and debugging.

TestBlangSmallExample runner = new TestBlangSmallExample();
CheckFixedRandomness checkRand = new CheckFixedRandomness(runner.factory);
checkRand.check(runner);

_{From:conifer.TestBlangSmallExample}

Bayesian Phylogenetic inference

We will now apply this framework to a more complex problem, Bayesian phylogenetic inference. We start with the core of the likelihood calculation, the matrix exponential.

Implementing the matrix exponential

Fill in marginalTransitionProbability().

Use the diagonalization method covered in class, using the eigen-decomposition functionalities provided by EJML.

See EJMLUtils.java in the bayonet project (v. 2.0.3).

_{From:conifer.ctmc.CTMC}

Fill in stationaryDistribution().

Show how you can reduce the problem of finding the stationary distribution of a CTMC to a eigenvalue problem. Again, solve this problem using EJML.

_{From:conifer.ctmc.CTMC}

Computing the likelihood on a tree

Next, we will use the matrix exponential and a discrete factor graph to compute the likelihood of the data under a certain tree.

First, derive the form of the factor graph that is needed to do this calculation. See lecture 10 for a refresher on the model.

Look at the documentation in DiscreteFactorGraph for help on how to create the factor graph.

Look also at EdgeSorter to orient the edges of the tree (recall that you can root the tree arbitrarily).

_{From:conifer.factors.TreeLikelihood}

Sampling the tree

We now have a working likelihood factor, but note the tree random variable it is connected to is not of type supported by default in Blang. This brings us to the next topic, how to create samplers for new datatypes. Notice that at the top of the class declaration of UnrootedTree, there is an annotation @Samplers({SingleNNI.class, SingleBranchScaling.class}), which specifies which samplers to use for this type of object.

Let us look at one of them, SingleNNI.

This is the core of the tree sampling procedure, and also illustrate the simplest way of creating new samplers in Blang:

create a class that implements HMProposalDistribution
add a field annotated with @SampledVariable which will be automatically populated with the variable to resample.
add a list of factor fields annotated with @ConnectedFactor
implement the propose() method to modify the variable in place.

Your exercise here is to prove that the combination of SingleBranchNNI's and SingleBranchScaling's make the sampler irreducible.

List<UnorderedPair<TreeNode, TreeNode>> nonTerminalEdges = TopologyUtils.nonTerminalEdges(tree.getTopology());
UnorderedPair<TreeNode, TreeNode> referenceEdge = DiscreteUniform.sample(nonTerminalEdges, rand);
TreeNode moved1 = sampleMovedEndpoint(referenceEdge, referenceEdge.getFirst(), rand);
TreeNode moved2 = sampleMovedEndpoint(referenceEdge, referenceEdge.getSecond(), rand);
tree.interchange(moved1, referenceEdge.getFirst(), moved2, referenceEdge.getSecond());
return new ProposalRealization(moved2, referenceEdge.getFirst(), moved1, referenceEdge.getSecond());

_{From:conifer.moves.SingleNNI}

Running the phylogenetic sampler

Finally, you can now run the phylogenetic sampler.

Modify the process() function below to save the output trees to a file. Use UnrootedTreeUtils.toNewick() to save the files in the newick format, a standard way of saving tree.

protected void process(blang.processing.ProcessorContext)
{
    samples.add(treePrior.branchDistribution.parameters.rate.getValue());
}

_{From:conifer.SimplePhyloModel}

You can look at these trees using for example FigTree

Optional: summarizing the trees

Warning: For this part, you may need to update your git repository to the latest version of the scaffold code. (code updated on March 10)

Finally, summarize all these trees using a consensus tree. Recall that we discuss in class that the consensus tree corresponds to a Bayes estimator, see http://sysbio.oxfordjournals.org/content/57/5/814.full for the details.

You can use existing implementations of consensus tree software, for example hashcs:

Download the software at https://code.google.com/p/hashcs/downloads/list
Unzip, configure using ./configure from the unzipped directory
make
You can now compute consensus using ./hashcs path/to/treefile 0, where path/to/treefile should be replaced by the path to the file containing the newick files, one per line.

Use a strict consensus. See the documentation for more info.

Include a visualization of the consensus tree in your report.

alexandrebouchard / conifer-scaffold Goto Github PK

conifer-scaffold's Introduction