lucylgao / clusterpval Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
Hi,
I am interested in using this interesting approach for post-clustering inference, but I need to test the difference in means in each dimension separately (for differential expression analysis).
Is this possible/appropriate using this method?
Thanks in advance!
Dear authors,
Thank you for writing this package. I could understand the motivation behind it, however, owing to my lack of in-depth knowledge in this area, I find it hard to imagine how to include this package in my existing scRNAseq analysis workflow.
The vignette showed how to test for the difference between two clusters. But then what should I do if I have 10 clusters? Does this mean that I should perform a pairwise comparison of all 10 clusters (10 choose 2 = 45 tests)?
And if my p-value is above a certain threshold, can I use it to justify the merging of a pair of clusters?
Thanks for your insights.
Best,
Mikhael
Like others who have opened issues, I'm interested in running this neat method to look at the difference of means in my clusters from single cell data. (Well, really I would like to adjust p-values for differential expression analysis, similar to #3 and #4, but as I understand it that extension is not possible right now). However running test_clusters_approx
is painfully slow, as in #2. I took a look at the source code and it seems like the function could easily be sped up by running the piece where it runs cl_fun
in a for loop in parallel (using future
or any other parallellization package):
From
for(j in 1:ndraws) {
if(phi[j] < 0) next
# Compute perturbed data set
Xphi <- X
Xphi[cl == k1, ] <- t(orig_k1 + (phi[j] - stat)*k1_constant)
Xphi[cl == k2, ] <- t(orig_k2 + (phi[j] - stat)*k2_constant)
# Recluster the perturbed data set
cl_Xphi <- cl_fun(Xphi)
if(preserve_cl(cl, cl_Xphi, k1, k2)) {
log_survives[j] <- -(phi[j]/scale_factor)^2/2 + (q-1)*log(phi[j]/scale_factor) - (q/2 - 1)*log(2) - log(gamma(q/2)) - log(scale_factor) -
stats::dnorm(phi[j], mean=stat, sd=scale_factor, log=TRUE)
}
}
to
future_lapply(X = 1:ndraws, FUN = function(j) {
if(phi[j] < 0) next
# Compute perturbed data set
Xphi <- X
Xphi[cl == k1, ] <- t(orig_k1 + (phi[j] - stat)*k1_constant)
Xphi[cl == k2, ] <- t(orig_k2 + (phi[j] - stat)*k2_constant)
# Recluster the perturbed data set
cl_Xphi <- cl_fun(Xphi)
if(preserve_cl(cl, cl_Xphi, k1, k2)) {
log_survives[j] <- -(phi[j]/scale_factor)^2/2 + (q-1)*log(phi[j]/scale_factor) - (q/2 - 1)*log(2) - log(gamma(q/2)) - log(scale_factor) -
stats::dnorm(phi[j], mean=stat, sd=scale_factor, log=TRUE)
}
}
Would you be willing to implement this? I am also happy to make the modifications and submit a pull request for it.
Thank you for your work and for making this terrific R package!
I noticed when trying to run test_clusters_approx on the Zheng dataset from your clusterpval-experiments repo that test_complete_hier_clusters_approx and test_clusters_approx return NaN for pval and stderr on datasets with many columns. I think the gamma(q/2)
on line 278 (in test_complete_hier_clusters_approx) and line 426 (in test_clusters_approx) will cause a float overflow if q is greater than 343 (if the input data has 344 or more columns). This float overflow causes the log_survives value to always be infinity, which I think makes the pval and stderr return NaN.
Because the gamma(q/2) value is immediately logged, and we know that q/2 will always be either a positive integer or a positive half integer (and thus we can calculate the particular values of the gamma function at each possible value of q/2), I think we can calculate the log(gamma(q/2))
value ahead of time taking advantage of the fact that log(a*b) = log(a) + log(b) and log(a/b) = log(a) - log(b).
I wrote a few helper functions that calculate the log(gamma(q/2))
values directly without overflowing a float (I will submit a pull request and link this issue). When using these helper functions both functions return float values for pval and stderr. Please let me know if I have misunderstood anything and thank you very much again for this R package!
Hello,
First off: thanks for the great paper, talks and package. The concept of computing p-values on clusters has been very useful to me.
When using large datasets I sometimes notice that the computation of p-values is very time demanding, however. I therefore profiled where I lose the most time, and found stats::dist
to be the main offender when using
test_hier_clusters_exact(X=X, link="average")
So I played around with alternative ways to yield the same result as stats::dist
but faster. I stumbled across this post by Florian Privé who deserves full credit for the following implementation, which I adapted to return the exact same data as stats::dist
for consistency.
prive_dist = function(X){
# Credit: Florian Privé (https://stackoverflow.com/a/59687204)
suppressWarnings(as.dist(sqrt(outer(rowSums(X^2), rowSums(X^2), '+') - tcrossprod(X, 2 * X))))
}
For large dataset, Privé's implementation is much faster, but performs worse for smaller datasets. See the following:
library(microbenchmark) # used for benchmarking
# which dataset sizes to test for distance computation
N = 10^(2:6)
# wrapper to benchmark a dataset of a given size
times = function(N){
print(N)
X = matrix(runif(N), nrow=sqrt(N))
timings = microbenchmark(stats::dist(X),
prive_dist(X),
times=10)
print(timings, unit="relative")
}
sapply(N, times)
To which the output looks like this
[1] 100
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 1.000000 1.000000 1.000000 1.000000 1.00000 1.00000 10 a
prive_dist(X) 6.718533 6.702482 6.758787 5.977713 5.77667 8.49686 10 b
[1] 1000
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 10 a
prive_dist(X) 1.366609 1.36663 1.411547 1.397399 1.460971 1.445506 10 b
[1] 10000
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 3.707272 7.008658 4.558213 6.926236 6.633612 1.34467 10 b
prive_dist(X) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
[1] 1e+05
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 16.49259 15.07916 16.4001 13.62932 19.25519 14.8057 10 b
prive_dist(X) 1.00000 1.00000 1.0000 1.00000 1.00000 1.0000 10 a
[1] 1e+06
Unit: relative
expr min lq mean median uq max neval cld
stats::dist(X) 73.23724 55.5484 54.38444 54.05382 50.83549 46.74125 10 b
prive_dist(X) 1.00000 1.0000 1.00000 1.00000 1.00000 1.00000 10 a
Arguably, the worse performance in small datasets is negligible given that it would still run fast in absolute terms, whereas the performance gain in larger datasets is massive.
One might want to think about implementing it in a way that you always get best performance. For example like this:
custom_dist = function(X){
if(length(X) < 1000){
return(stats::dist(X))
} else {
return(prive_dist(X))
}
}
If you think this is a good idea I could submit a PR, probably next week.
Thanks again for the package and al your work! Best wishes,
Jesko
Hi Lucy,
Thank you so much for developing this tool.
I currently identify clusters through Seurat and I would like to use your tools to test the difference in mean of two clusters. I think I might use test_clusters_approx function for this:
_X1<-as.matrix(combined@assays[["RNA"]]@CountS)
cluster<[email protected][["seurat_clusters"]]
function_cluster <- function(X) {
combined<-CreateSeuratObject(X)
combined <- NormalizeData(combined, normalization.method = "LogNormalize", scale.factor = 10000)
combined <- FindVariableFeatures(combined, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(combined)
variable_gene<-combined@assays[["RNA"]]@var.features
combined<-ScaleData(combined, features = all.genes)
combined <- RunPCA(combined,features = all.genes)
combined <- RunUMAP(combined,features = all.genes)
combined <- RunTSNE(combined,features = all.genes)
combined <- FindNeighbors(combined,features = all.genes)
combined <- FindClusters(combined, resolution = 0.5)
return([email protected][["seurat_clusters"]])
}
cluster<-function_cluster(X1)
test_clusters_approx(X1, k1=1, k2=2, cl=cluster,cl_fun = function_cluster,ndraws=10000)_
combined is the Seurat object, when i turn into the final step, the function test_clusters_approx keeps running the cluster function over and over again...
Do you have any solutions for that?
Thank you!
Hi Lucy,
I am interested in using your statistical model to test gene expression differences between clusters.
My input is a matrix where: columns correspond to cells, rows to genes. Usually I would apply a (row-wise) wilcox test in order to check if the mean expression of one group of cells (belonging to a cluster) is significantly higher/lower than in the other group. This way, I can obtain a list of genes significantly up/down regulated between these two groups.
I have been testing your clusterpval tool but I am not sure how to apply it in cases like this. Can you help me?
Best,
Coral
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.