GithubHelp home page GithubHelp logo

celda's Introduction











Note: This repository is no longer updated. Please go to https://github.com/campbio/celda for latest updates.
























celda: CEllular Latent Dirichlet Allocation

"celda" stands for "CEllular Latent Dirichlet Allocation", which is a suite of Bayesian hierarchical models and supporting functions to perform gene and cell clustering for count data generated by single cell RNA-seq platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications. Celda has advantages over other clustering frameworks:

  1. Celda can simultaneously cluster genes into transcriptional states and cells into subpopulations
  2. Celda uses count-based Dirichlet-multinomial distributions so no additional normalization is required for 3' DGE single cell RNA-seq
  3. These types of models have shown good performance with sparse data.

Installation Instructions

To install the most recent beta release of celda via devtools:

library(devtools)
install_github("compbiomed/[email protected]")

The most up-to-date (but potentially less stable) version of celda can similarly be installed with:

install_github("compbiomed/celda")

NOTE On OSX, devtools::install_github() requires installation of libgit2. This can be installed via homebrew:

brew install libgit2

Examples and vignettes

Vignettes are available in the package.

An analysis example using celda with RNASeq via vignette('celda-analysis')

New Features and announcements

The v0.4 release of celda represents a useable implementation of the various celda clustering models. Please submit any usability issues or bugs to the issue tracker at https://github.com/compbiomed/celda

You can discuss celda, or ask the developers usage questions, in our Google Group.

celda's People

Contributors

andrewgr12 avatar definitelysean avatar dfjenkins3 avatar hhtiffany avatar irisapo avatar jiangyuan-liu avatar joshua-d-campbell avatar lloydliu717 avatar masanao-yajima avatar seine avatar ykoga07 avatar zhewa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

celda's Issues

Normalize counts error

This line of code:

counts.norm = sweep(counts, 2, colSums(counts) * scale.factor, "/")

should be:

counts.norm = sweep(counts, 2, colSums(counts) / scale.factor, "/")

Also, can you export this function so we can use it?

Parameter to toggle saving Gibbs sampling history

We may want to run celda without caring about output data that could be used for diagnostics, such as the history of cluster assignments over each iteration of Gibbs sampling. We should add a parameter to the main function to toggle whether to return (or perhaps even store in memory) all iterations of Gibbs sampling results.

Consider only storing the previous iteration in memory and discarding over time, to keep the memory footprint really small.

Remove require statements from celda_heatmap()

Hi Iris,

Could you take a look at the beginning of the celda_heatmap() function? It looks like there are a few require() statements, that should be removed. Instead, wherever you would use a function from one of these packages, you should refer to it explicitly. This makes the code cleaner, and also shows readers which functions are in our package, and which functions are not.

For example, if I was going to use lmFit() from the limma package, I'd write explicitly: limma::lmFit().
You can see another example in util.R, where I reference the Rmpfr package explicitly.

Let me know if you have any questions!

Uniform function to calculate log-likelihood (with multiplexing)

The gene clustering, cell clustering, and gene-cell clustering code all have differently parameterized, user facing functions to calculate log likelihoods (e.g. cC.calcLLFromVariables).

We should consider making a uniform "calculate_log_likelihood" function, the parameters to which would indicate which of these specific types of calculations should be performed. Each submethod should be implemented in a separate function. Ideally the different log-likelihood submethods would exist in their own R script, away from the Gibbs sampling code.

The only sticking point I see is that these functions don't have a celda result object as a parameter, so an S3 method dispatch won't work...

Verify getModel returns model with params provided by user

getModel() currently will use the celda_list object's run.params table to determine where to look in res.list for the user's desired chain. If the user rearranges res.list, or modifies run.params, the incorrect model will be returned.

getModel() should double check that the model has the provided specifications, and if not, it should sequentially search through res.list until it finds the right model, returning NA if it still cannot be found.

Getters for celda_list

Need to implement S3 getters (e.g. completeLogLikelihood) that can return aggregate results over all models in a celda_list.

Option for log level setting

Some of the celda functions, when called directly, use cat() to output different messages (e.g. the iteration of Gibbs sampling, etc). We should provide an option to set the log level, and quell these messages for "quieter" (presumably default) log levels.

As suggested by @dfjenkins3

Error when given small matrix

Upon creating a random matrix (m <- matrix(2,nrow=2,ncol=2)), matrix was run celda(m,"celda_C"), which give this error

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "celda_C"
4.
stop(sprintf("unable to find variable "%s"", sym))
3.
e$fun(obj, substitute(ex), parent.frame(), e$data)
2.
foreach(i = 1:nrow(runs), .export = model, .combine = c, .multicombine = TRUE) %dopar%
{
chain.seed = all.seeds[ifelse(i%%nchains == 0, nchains,
i%%nchains)] ...
1.
celda(m, "celda_C")

Remove "All Zero" rows in simlulate functions

Sometimes, some genes will be all zero after simulation, depending on the parameters supplied. If someone, tries to then run the model on this matrix, it will fail. We should check for empty rows and remove them before returning the simulated matrix. Also, make sure to remove the corresponding entries in the "y" vector.

Derive the z.probs/y.probs from getters

z.probs / y.probs are not returned by default on the celda_list object returned from celda(). Their corresponding getter functions should calculate and return them post-facto.

Fix Model Dispatch From celda()

The celda() function provides the sample.label parameter to all model functions (e.g. celda_CG()). However, celda_G() doesn't accept this parameter and throws an error.

We should ideally use an S3 scheme to resolve the correct model function to run, and only provide the parameters necessary.

Validate matrix function

Required:

  1. no all zero rows
  2. no all zero columns

Optional:

  1. Have an input parameter of rownames and colnames, should check that matrix has the same rownames/colnames
  2. Matrix dimensions
  3. total counts of matrix (this would need to be added to output of celda object so it can be checked in downstream functions)

Fail gracefully when unnecessary parameters are provided to celda_C / celda_G

@joshua-d-campbell might be interested in this bug:

I tried to run a celda_C run, providing the same parameters that I'd provide to celda_CG. These parameters get passed to expand.grid, which sets up the parameters for the individual chains to be run. If you pass an unused parameter (in this case, I passed "L=10), forEach throws an error:

Error in { : task 1 failed - "unused argument (L = 10)"

The tool shouldn't fail when this happens. It should do one of the following:

  1. Produce a warning about the unused parameter, or
  2. Fail gracefully with an appropriate error message

Extract best model

Extract the best chain from the celda wrapper based on a given set of parameters. E.g.

extractBestModel(celda.wrapper.obj, K=10, L=15)

where celda.wrapper.obj is the output from "celda"

Remove visualizeModelPerformance() and Related

Continuing from #30 , refactor visualize_performance() so it has different implementations by performance metric. There's lots of weird behavior in the function currently to dance around the different methods' means of working.

Class of celda objects

Have the output of each individual celda function return a unique class (e.g. celda_C should return an object with class 'celda_C'). The celda wrapper function should return a list of these celda objects in the 'res.list' variable.

Once a user selects best choice of K/L parameters after looking at the perplexity plots, there should be a getter function to pull out the correct celda object from the larger list. E.g.

selected.model = getBestModel(celda.wrapper.obj, K=10, L=20, chain=1)

would return the chain of the models that had K=10 and L=20. And/or one could do:

selected.model = getBestModel(celda.wrapper.obj, K=10, L=20, best="loglik")

would return the chain with the best log likelihood from the set of chains that matched K=10 and L=20. 'best' could be based on loglik, perplexity, or whatever metric we have available.

z/y history is different than returned z/y

The z/y variables get reordered at the end of each model to something more systematic. However, the z/y history variables, if saved, still have the original z/y designations. E.g. all z's equal to 1 in the z output variable will be different than the z's equal to 1 in the saved history. Need to go through and rename the entries of the z/y history variables

Make celda_C model work with celda()

Currently, only the celda_CG model is run properly when using the parallelized master front-end to celda, the celda() function (celda.R). We obviously need to extend this to the additional models as well.

toy dataset

Below is code to generate a small toy dataset that can be used to illustrate how the different models work. Can this be added to the package as attached data?

set.seed(123)
p1.1 = c(0.5, 0.15, 0.1, 0, 0, 0, 0, 0, 0.2, 0.05)
p1.2 = c(0.5, 0.15, 0.1, 0, 0, 0, 0, 0, 0.2, 0.05)
p2.1 = c(0, 0, 0, 0.45, 0.15, 0.15, 0, 0, 0.2, 0.05)
p2.2 = c(0, 0, 0, 0.5, 0.25, 0.25, 0, 0, 0.2, 0.05)
p3 = c(0, 0, 0, 0, 0, 0, 0.1, 0.9, 0, 0)
p4 = (p1.1 + p2.2) /2
r.cells = cbind(rmultinom(1, size=100, prob=p1.1), rmultinom(2, size=125, prob=p1.2), rmultinom(1, size=125, prob=p2.1), rmultinom(2, size=100, prob=p2.2), rmultinom(3, size=100, prob=p3), rmultinom(3, size=100, prob=p4))

Bug in y.split.each

in split_clusters.R on line 207

pairs = c()

needs to be changed to:

pairs = c(NA, NA, NA)

Otherwise the wrong gene solution was getting chosen (off by one index)

Input Validation

Need to have a check at beginning to ensure the rows or columns of the matrix aren't all zeros. We can either throw an error or automatically filter out the offending rows/columns? My thought would be to throw an error. Any other check that should be performed?

Logging

Have output be sent to a log file so it can be viewed later.

fix gamma arg

On line 274 in celda_CG.R in the cCG.calcLL function call, it should be

gamma=gamma

not

gamma=1

ll = cCG.calcLL(K=K, L=L, m.CP.by.S=m.CP.by.S, n.CP.by.TS=n.CP.by.TS, n.by.G=n.by.G, n.by.TS=n.by.TS, nG.by.TS=nG.by.TS, nS=nS, nG=nG, alpha=alpha, beta=beta, delta=delta, gamma=a)

S3-ify visualize performance

Make the newly-introduced visualize_performance() function generic, and have concrete implementations for the different celda model classes.

Make sure to re-write to use celda_list getters where appropriate, especially when pulling out the model performance metrics ( #56 )

Colnames of factorized sample matrix

In both "factorizeMatrix.celda_CG" and "factorizeMatrix.celda_C", the line:

colnames(m.CP.by.S) = colnames(counts)

is incorrect. The colnames should be set to the sample ids, not the colnames of the counts matrix (i.e. cell ids). We could do:

colnames(m.CP.by.S) = unique(sample.label)

ASSUMING that they supplied the same sample label as before and it is in the same order. But that might be risky. We could also have the unique sample ids be returned as a vector in the celda output objects. Then the line would be:

colnames(m.CP.by.S) = celda.obj$sample.ids

or something like that. If we don't do anything, then these lines need to be commented out otherwise it throws an error.

Seeds in parallel chains

Different seeds should be used for different chains with the same set of parameters in the "celda" function. For example, if someone requests 3 chains and the seed is 12345, then the seed for each chain should be 12345, 12346, 12347 (or something like that). You might also want to add the option for someone to specify the seed for each chain. E.g. if 3 chains are requested, then the user can input 1 number which will be added to as in the above example or the user can input 3 numbers directly.

another celda_CG bug

On line 309, the args to split.z at the end should be:

delta=delta, gamma=gamma)

So change "delta=1" to "delta=delta" and add "gamma=gamma"

S3 accessors for celda results

Make getter functions for each celda model that make getting various fields (e.g. cluster assignments and history, complete log likelihood, etc.) uniform and easy-to-use.

Make return values from S3 functions more ergonomic

For example, completeClusterHistory() returns the complete cluster history for each chain concatenated together into a really long table (with one column for each chain). It should instead return a list of matrices.

set seed in celda wrapper

When I tried to set the "base" seed in the celda wrapper, I got the following error:

"formal argument "seed" matched by multiple actual arguments"

We should probably have "seed" be an argument in the "celda" function rather than having it be passed via the "...".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.