compbiomed / celda Goto Github PK

Bayesian Hierarchical Modeling for Clustering Single Cell Genomic Data

License: MIT License

R 95.84% C++ 1.84% C 2.32%

celda's Introduction

Note: This repository is no longer updated. Please go to https://github.com/campbio/celda for latest updates.

celda: CEllular Latent Dirichlet Allocation

"celda" stands for "CEllular Latent Dirichlet Allocation", which is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. Celda has advantages over other clustering frameworks:

Celda can simultaneously cluster genes into transcriptional states and cells into subpopulations
Celda uses count-based Dirichlet-multinomial distributions so no additional normalization is required for 3' DGE single cell RNA-seq
These types of models have shown good performance with sparse data.

Installation Instructions

To install the most recent beta release of celda via devtools:

library(devtools)
install_github("compbiomed/[email protected]")

The most up-to-date (but potentially less stable) version of celda can similarly be installed with:

install_github("compbiomed/celda")

NOTE On OSX, devtools::install_github() requires installation of libgit2. This can be installed via homebrew:

brew install libgit2

Examples and vignettes

Vignettes are available in the package.

An analysis example using celda with RNASeq via vignette('celda-analysis')

New Features and announcements

The v0.4 release of celda represents a useable implementation of the various celda clustering models. Please submit any usability issues or bugs to the issue tracker at https://github.com/compbiomed/celda

You can discuss celda, or ask the developers usage questions, in our Google Group.

celda's People

Contributors

Stargazers

Watchers

Forkers

albluca dfjenkins3 ykoga07 xingyishi definitelysean chlee-tabin pinardemetci bretonics hicsail adlewismbb andrewgr12 chitrita zha0rong

celda's Issues

Normalize counts error

This line of code:

counts.norm = sweep(counts, 2, colSums(counts) * scale.factor, "/")

should be:

counts.norm = sweep(counts, 2, colSums(counts) / scale.factor, "/")

Also, can you export this function so we can use it?

Parameter to toggle saving Gibbs sampling history

We may want to run celda without caring about output data that could be used for diagnostics, such as the history of cluster assignments over each iteration of Gibbs sampling. We should add a parameter to the main function to toggle whether to return (or perhaps even store in memory) all iterations of Gibbs sampling results.

Consider only storing the previous iteration in memory and discarding over time, to keep the memory footprint really small.

Remove require statements from celda_heatmap()

Hi Iris,

Could you take a look at the beginning of the celda_heatmap() function? It looks like there are a few require() statements, that should be removed. Instead, wherever you would use a function from one of these packages, you should refer to it explicitly. This makes the code cleaner, and also shows readers which functions are in our package, and which functions are not.

For example, if I was going to use lmFit() from the limma package, I'd write explicitly: limma::lmFit().
You can see another example in util.R, where I reference the Rmpfr package explicitly.

Let me know if you have any questions!

Uniform function to calculate log-likelihood (with multiplexing)

The gene clustering, cell clustering, and gene-cell clustering code all have differently parameterized, user facing functions to calculate log likelihoods (e.g. cC.calcLLFromVariables).

We should consider making a uniform "calculate_log_likelihood" function, the parameters to which would indicate which of these specific types of calculations should be performed. Each submethod should be implemented in a separate function. Ideally the different log-likelihood submethods would exist in their own R script, away from the Gibbs sampling code.

The only sticking point I see is that these functions don't have a celda result object as a parameter, so an S3 method dispatch won't work...

Count simulation code should be condensed into distinct file

All of the various count matrix simulation code should appear together in a single file or subdirectory.

celda_list subclasses based off of contents

Consider making subclasses like "celda_list.celda_C" to indicate celda_lists that contain celda_C models. Then, could implement different getModel() methods using S3.

Visualization of Model Performance by Model Parameters

Verify getModel returns model with params provided by user

getModel() currently will use the celda_list object's run.params table to determine where to look in res.list for the user's desired chain. If the user rearranges res.list, or modifies run.params, the incorrect model will be returned.

getModel() should double check that the model has the provided specifications, and if not, it should sequentially search through res.list until it finds the right model, returning NA if it still cannot be found.

Getters for celda_list

Need to implement S3 getters (e.g. completeLogLikelihood) that can return aggregate results over all models in a celda_list.

Option for log level setting

Some of the celda functions, when called directly, use cat() to output different messages (e.g. the iteration of Gibbs sampling, etc). We should provide an option to set the log level, and quell these messages for "quieter" (presumably default) log levels.

As suggested by @dfjenkins3

Error when given small matrix

Upon creating a random matrix (m <- matrix(2,nrow=2,ncol=2)), matrix was run celda(m,"celda_C"), which give this error

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "celda_C"
4.
stop(sprintf("unable to find variable "%s"", sym))
3.
e$fun(obj, substitute(ex), parent.frame(), e$data)
2.
foreach(i = 1:nrow(runs), .export = model, .combine = c, .multicombine = TRUE) %dopar%
{
chain.seed = all.seeds[ifelse(i%%nchains == 0, nchains,
i%%nchains)] ...
1.
celda(m, "celda_C")

Remove "All Zero" rows in simlulate functions

Sometimes, some genes will be all zero after simulation, depending on the parameters supplied. If someone, tries to then run the model on this matrix, it will fail. We should check for empty rows and remove them before returning the simulated matrix. Also, make sure to remove the corresponding entries in the "y" vector.

Derive the z.probs/y.probs from getters

z.probs / y.probs are not returned by default on the celda_list object returned from celda(). Their corresponding getter functions should calculate and return them post-facto.

Fix Model Dispatch From celda()

The celda() function provides the sample.label parameter to all model functions (e.g. celda_CG()). However, celda_G() doesn't accept this parameter and throws an error.

We should ideally use an S3 scheme to resolve the correct model function to run, and only provide the parameters necessary.

Validate matrix function

Required:

no all zero rows
no all zero columns

Optional:

Have an input parameter of rownames and colnames, should check that matrix has the same rownames/colnames
Matrix dimensions
total counts of matrix (this would need to be added to output of celda object so it can be checked in downstream functions)

Fail gracefully when unnecessary parameters are provided to celda_C / celda_G

@joshua-d-campbell might be interested in this bug:

I tried to run a celda_C run, providing the same parameters that I'd provide to celda_CG. These parameters get passed to expand.grid, which sets up the parameters for the individual chains to be run. If you pass an unused parameter (in this case, I passed "L=10), forEach throws an error:

Error in { : task 1 failed - "unused argument (L = 10)"

The tool shouldn't fail when this happens. It should do one of the following:

Produce a warning about the unused parameter, or
Fail gracefully with an appropriate error message

Multiplex simulateCells, with a model parameter

Empty function call returns error. Parameter check needed.

celda()
chain
1 1
Show Traceback

Rerun with Debug
Error in foreach(i = 1:nrow(runs), .export = model, .combine = c, .multicombine = TRUE) :
argument "model" is missing, with no default

Extract best model

Extract the best chain from the celda wrapper based on a given set of parameters. E.g.

extractBestModel(celda.wrapper.obj, K=10, L=15)

where celda.wrapper.obj is the output from "celda"

Remove visualizeModelPerformance() and Related

Continuing from #30 , refactor visualize_performance() so it has different implementations by performance metric. There's lots of weird behavior in the function currently to dance around the different methods' means of working.

Rename celda_C, celda_G, celda_CG internal function names

Named vectors for finalClusterAssigments() results, rather than matrix

Matrix factorization for population states

The "probability" and "posterior" matrices are incorrect for celda_C and celda_CG. They need to be normalized over rows and not columns.

S3 functions for heatmap generation

Class of celda objects

Have the output of each individual celda function return a unique class (e.g. celda_C should return an object with class 'celda_C'). The celda wrapper function should return a list of these celda objects in the 'res.list' variable.

Once a user selects best choice of K/L parameters after looking at the perplexity plots, there should be a getter function to pull out the correct celda object from the larger list. E.g.

selected.model = getBestModel(celda.wrapper.obj, K=10, L=20, chain=1)

would return the chain of the models that had K=10 and L=20. And/or one could do:

selected.model = getBestModel(celda.wrapper.obj, K=10, L=20, best="loglik")

would return the chain with the best log likelihood from the set of chains that matched K=10 and L=20. 'best' could be based on loglik, perplexity, or whatever metric we have available.

Implement function to calculate geometric mean for marginal likelihood

z/y history is different than returned z/y

The z/y variables get reordered at the end of each model to something more systematic. However, the z/y history variables, if saved, still have the original z/y designations. E.g. all z's equal to 1 in the z output variable will be different than the z's equal to 1 in the saved history. Need to go through and rename the entries of the z/y history variables

Make celda_C model work with celda()

Currently, only the celda_CG model is run properly when using the parallelized master front-end to celda, the celda() function (celda.R). We obviously need to extend this to the additional models as well.

toy dataset

Below is code to generate a small toy dataset that can be used to illustrate how the different models work. Can this be added to the package as attached data?

set.seed(123)
p1.1 = c(0.5, 0.15, 0.1, 0, 0, 0, 0, 0, 0.2, 0.05)
p1.2 = c(0.5, 0.15, 0.1, 0, 0, 0, 0, 0, 0.2, 0.05)
p2.1 = c(0, 0, 0, 0.45, 0.15, 0.15, 0, 0, 0.2, 0.05)
p2.2 = c(0, 0, 0, 0.5, 0.25, 0.25, 0, 0, 0.2, 0.05)
p3 = c(0, 0, 0, 0, 0, 0, 0.1, 0.9, 0, 0)
p4 = (p1.1 + p2.2) /2
r.cells = cbind(rmultinom(1, size=100, prob=p1.1), rmultinom(2, size=125, prob=p1.2), rmultinom(1, size=125, prob=p2.1), rmultinom(2, size=100, prob=p2.2), rmultinom(3, size=100, prob=p3), rmultinom(3, size=100, prob=p4))

Optimize and integrate celda_G with main celda() function

Let me know if you want me to tackle this one, @joshua-d-campbell . I can do so based off of your changes to celda_C.

Choose chain for render_celda_heatmap()

celda_heatmap() currently won't work correctly if there's multiple chains

celda() Validation Code Unit Tests

Sample.label instead of samples

Make $sample from simulateCells.celda into $sample.label (so it matches $sample.label in celda output)

Reorder labels based on user-defined classification

Add a function to relabel z/y variable and reorder all other objects within a celda object accordingly

Bug in y.split.each

in split_clusters.R on line 207

pairs = c()

needs to be changed to:

pairs = c(NA, NA, NA)

Otherwise the wrong gene solution was getting chosen (off by one index)

Help file for sample.cells example data

Squelch doParallell thread closing messages

Input Validation

Need to have a check at beginning to ensure the rows or columns of the matrix aren't all zeros. We can either throw an error or automatically filter out the offending rows/columns? My thought would be to throw an error. Any other check that should be performed?

Logging

Have output be sent to a log file so it can be viewed later.

fix gamma arg

On line 274 in celda_CG.R in the cCG.calcLL function call, it should be

gamma=gamma

not

gamma=1

ll = cCG.calcLL(K=K, L=L, m.CP.by.S=m.CP.by.S, n.CP.by.TS=n.CP.by.TS, n.by.G=n.by.G, n.by.TS=n.by.TS, nG.by.TS=nG.by.TS, nS=nS, nG=nG, alpha=alpha, beta=beta, delta=delta, gamma=a)

S3-ify visualize performance

Make the newly-introduced visualize_performance() function generic, and have concrete implementations for the different celda model classes.

Make sure to re-write to use celda_list getters where appropriate, especially when pulling out the model performance metrics ( #56 )

Colnames of factorized sample matrix

In both "factorizeMatrix.celda_CG" and "factorizeMatrix.celda_C", the line:

colnames(m.CP.by.S) = colnames(counts)

is incorrect. The colnames should be set to the sample ids, not the colnames of the counts matrix (i.e. cell ids). We could do:

colnames(m.CP.by.S) = unique(sample.label)

ASSUMING that they supplied the same sample label as before and it is in the same order. But that might be risky. We could also have the unique sample ids be returned as a vector in the celda output objects. Then the line would be:

colnames(m.CP.by.S) = celda.obj$sample.ids

or something like that. If we don't do anything, then these lines need to be commented out otherwise it throws an error.

Seeds in parallel chains

Different seeds should be used for different chains with the same set of parameters in the "celda" function. For example, if someone requests 3 chains and the seed is 12345, then the seed for each chain should be 12345, 12346, 12347 (or something like that). You might also want to add the option for someone to specify the seed for each chain. E.g. if 3 chains are requested, then the user can input 1 number which will be added to as in the above example or the user can input 3 numbers directly.

celda_heatmap.celda_C does cannot plot cluster with labels

celda_heatmap.celda_C is not able to plot the cluster assignments. It needs a way to find a combination of chain and K by being passed as a parameter.

color centering of render_celda_heatmap

When there is imbalance in expression levels for positive and negative values the center of the color gets shifted and therefore 0 is not aligned with white.

toy_celda_c = celda(sample.cells, model="celda_C",K = 4)
render_celda_heatmap(counts = sample.cells, z=finalClusterAssignment(toy_celda_c))

This can be fixed using breaks as detailed here:
http://stackoverflow.com/questions/31677923/set-0-point-for-pheatmap-in-r

another celda_CG bug

On line 309, the args to split.z at the end should be:

delta=delta, gamma=gamma)

So change "delta=1" to "delta=delta" and add "gamma=gamma"

We should probably have "seed" be an argument in the "celda" function rather than having it be passed via the "...".