danko-lab / bayesprism Goto Github PK

View Code? Open in Web Editor NEW

125.0 125.0 41.0 73.06 MB

A Fully Bayesian Inference of Tumor Microenvironment composition and gene expression

R 3.15% HTML 96.85%

bayesprism's People

Contributors

Stargazers

Watchers

bayesprism's Issues

cell states between map and phi_cellState do not match

Dear,

I am using BayesPrism to perform deconvolution by using two different levels of annotation (cell types and cell subtypes).
However, I get this following error, while using only either annotation level works.

Error in validObject(.Object) :
invalid class “prism” object: cell states between map and phi_cellState do not match

The table of cell type labels and cell state labels looks like this.

Thanks in advance.

Kind regards,
Seoyeon

Estimating the proportion of 'unknown' cells.

Hi,

I was wondering if there is a way to estimate the proportion of cells that aren't present in the reference cells? I've deconvoluted bulk RNA-seq data for a tumour with a reference atlas from the same healthy tissue. There is no single cell data on this type of cancer, so I was wondering if I could predict gene expression of cancer cells from BayesPrism output? For example, EPIC gives a proportion of 'otherCells' in its output, so I was wondering if I could get something like this from BayesPrism.

Many thanks,
Alina

The problem of missing mouse genome files

Hi，

When I assign 'mm' to the parameters 'species', in the step of filtering out the outlier genes, the function select.gene.type() will get an error. It might be due to lack of gene annotation file for mouse, like the "gencode.v22.broad.category.txt" file in the extdata folder.

By the way, cleanup.genes() only support "MALAT1". If the species is mouse and change MALAT1 to Malat1, you will also get an error.

Thanks in advance !

Can BayesPrism use for microarray matrix?

Many of the datasets I have studied are microarray datasets. Can BayesPrism use for microarray datasets? Thank you very much.

Typo in get.exp.stat arguments

Hi devs,

I noticed there is a pretty glaring typo in the arguments of the get.exp.stat function:

psuedo.count
A numeric value used for log2 transformation. =0.1 for 10x data, =10 for smart-seq. Default=0.1.

It seems like it should be named "pseudo.count" instead, and running the function with the correct spelling raises an unused argument error.

"Killed singularity $*" error running new.prism()

Hello!
I am trying to run BayesPrism for my bulk RNAseq data sample size ~200 and Single cell data, no of cells ~70000. I am working with non-malignant samples. Here is the issue I am facing. Please let me know how to solve this.

For my single cell data only cell types information is available. I am not sure how to subtype these cells to come up with the cell states. For now I am using the cell types as both cell.type.labels and cell.state.labels as it is mandatory to provide both. Let me know if it is ok.
The initial data preparation went very well for my data. I just converted the single cell data sparse matrix to dense matrix and I selected only protein coding genes for Prism construction.

My code is here:
myPrism <- new.prism(
reference=sc.dat.filtered.pc,
mixture=bk.dat,
input.type="count.matrix",
cell.type.labels = cell.type.labels,
cell.state.labels = cell.type.labels,
key="NULL",
outlier.cut=0.01,
outlier.fraction=0.1,
)

#I am running this job in a cluster with 120GB memory. This job runs for few minutes, provides the cell state information and terminates with this error.

I am getting following error
#> number of cells in each cell state
#> cell.state.labels
#> PJ017-tumor-6 PJ032-tumor-5 myeloid_8 PJ032-tumor-4
#> 22 41 49

/sw/Containers/singularity/bin/run_singularity: line 28: 42360 Killed singularity $*

Using cross-species reference and combining references

Hi!

I have two questions about the sc reference:

Is it ok to us data from different species, i.e. mouse sc data to deconvolute human bulk?
Is it ok to mix sc data from different labs/publications etc in the reference or does this introduce batch effects that can not be accounted for?

Thank you!

What makes a cell type better predicted ?

Hello,
This is an open thread to share experience. I would like to know what the best inferred cell types are, and what does it depend on. Let's say you have a reference single cell data, and a test bulk data. Do you think that the most abundant cell types in reference or test are better inferred ? In my experience, not necessarily.
Thanks !

Key in new.prism()

I had a seurat object for B cells which I was using as a reference for my mixture (bulk-seq).

In the step: myPrism <- new.prism(
reference=sc.dat.filtered,
mixture=x,
input.type="count.matrix",
cell.type.labels = cell.type.labels,
cell.state.labels = cell.type.labels_broad,
key="B cell memory",
outlier.cut=0.01,
outlier.fraction=0.1,
)

I wanted to know how to select key in this function as there is no tumour. I tried using NA, although it returned: Error in validObject(.Object) : invalid class “prism” object: invalid key

I then tried using other cell types present and found that only B cell memory was valid for key. In my reference, B cell memory was not the most abundant cell type, so I am not sure how would it choose as a key?

Could you please provide me with more information on how keys are selected?

Many thanks!

TPM as input for both scRNA and bulk data

Hi, really appreciate for developing this excellent tool.

Although raw counts are recommended data type as input, I noticed that TPM data was used in the reference article as well:

"For instances where only TPM normalized data were available the scRNA-seq reference for HNSCC (scHNSCC), we summed TPM normalized reads."

Is it OK if I use TPM data as input for both scRNA and bulk data? If feasible, does the TPM data need to be log transformed?

Many thanks,
Todd

embedding learning

Dear Danko,
I sincerely admire your research. I have a question: how to improve the speed of embedding learning? Is it through scRNA data reduction?
Thank you very much for your attention and consideration.

one or more cell states belong to multiple cell types

When I use

new.prism(
  reference=sc.dat.filtered.pc,
  mixture=bk.dat,
  input.type="count.matrix",
  cell.type.labels = cell.type.labels,
  cell.state.labels = cell.state.labels,
  key="tumor",
  outlier.cut=0.01,
  outlier.fraction=0.1,
)

I met an error:

one or more cell states belong to multiple cell types

the cell.state.labels here are the patient names, did I made some mistakes?

Can BayesPrism use for no-tumor tissue RNA-seq?

I want to deconvolute non-tumor tissue, my cell.type.labels is same with cell.type.labels, and the "key" of new.prism is set to NULL, is it correct？

Using cell subtypes (deeper annotation) and cell types

Hi,

I am interested in deconvolution of PBMC and Whole blood samples using BayesPrism.
I have the data with two different levels of annotation (8 cell types and 32 cell subtypes).
However, I got an error with the following:

"one or more cell states belong to multiple cell types".

Could you give me an example of names for "cell type labels" and "cell state labels" ?
Or Is there a way to circumvent this error?

Thank you in advance.

Kind regards,
Seoyeon

Missing genes in the tumor-specific gene expression count matrix Z

I have tried using get.exp to get a tumor-specific gene expression count matrix Z, but I noticed that the Z only contained the expression profile of 4806 genes, which was much fewer than both the bulk matrix and the scRNA-Seq reference. I wonder if this is normal or if there is something I need to check for troubleshooting.

Potential for scATAC-seq?

Hello,

I was wondering if this could potentially be used for scATAC-seq reference peaks to deconvolute bulk ATAC-seq expression?
Specifically I was wondering the constraints on the input matrix for such a way to have it work? I'm assuming I need to convert the peaks into an integer matrix.

Best,
Chang

About the acceleration of the computation

Will there be a Python version in the future? It may speed things up. Right now it takes me at least half an hour to run with 300 patients.

proper use of get.exp() function

Dear Sir or Madam,

I am having trouble to get the matrix of cell-type expression from bayesprism output. I am trying to use :

get.exp(bp.res, state.or.type="type")

But the latter unfortunately only give me back a flat vector and despite my effort to build a 3d array with it, I cannot find the right dimension order. How to get the 3d array we are supposed to get from this (e.g. high-resolution cell-type expression)

Thanks in advance ! Regards,

Alexandre Coudray
PhD student in Trono/La Manno group, EPFL Switzerland

Minimum sample size requirement

Hello,
I would like to know what the minimum requirement in sample size is for bulk RNAseq in order to trust the inferred cell fraction. Do we need at least 10, 20 samples ?
Thanks !

Installation problem with error message ‘Matrix (>= 1.5.0)’ is not available

Hello !

I cannot install BayesPrism on R4.2.2 with the following error:

Downloading GitHub repo Danko-Lab/BayesPrism@HEAD

formatR (NA → 1.12 ) [CRAN]
futile.op… (NA → 1.0.1 ) [CRAN]
lambda.r (NA → 1.2.4 ) [CRAN]
snow (NA → 0.4-4 ) [CRAN]
futile.lo… (NA → 1.4.3 ) [CRAN]
RcppHNSW (NA → 0.3.0 ) [CRAN]
BiocNeigh… (NA → 1.16.0 ) [CRAN]
BiocParallel (NA → 1.32.6 ) [CRAN]
beachmat (NA → 2.14.2 ) [CRAN]
rsvd (NA → 1.0.5 ) [CRAN]
ScaledMatrix (NA → 1.6.0 ) [CRAN]
sparseMat… (NA → 1.10.0 ) [CRAN]
locfit (NA → 1.5-9.5 ) [CRAN]
DelayedMa… (NA → 1.20.0 ) [CRAN]
registry (NA → 0.5-1 ) [CRAN]
metapod (NA → 1.6.0 ) [CRAN]
bluster (NA → 1.8.0 ) [CRAN]
BiocSingular (NA → 1.14.0 ) [CRAN]
statmod (NA → 1.4.36 ) [CRAN]
edgeR (NA → 3.40.2 ) [CRAN]
scuttle (NA → 1.8.4 ) [CRAN]
gridBase (NA → 0.4-7 ) [CRAN]
rngtools (NA → 1.5.2 ) [CRAN]
pkgmaker (NA → 0.32.2 ) [CRAN]
scran (NA → 1.26.2 ) [CRAN]
NMF (NA → 0.24.0 ) [CRAN]
snowfall (NA → 1.84-6.1) [CRAN]
Installing 27 packages: formatR, futile.options, lambda.r, snow, futile.logger, RcppHNSW, BiocNeighbors, BiocParallel, beachmat, rsvd, ScaledMatrix, sparseMatrixStats, locfit, DelayedMatrixStats, registry, metapod, bluster, BiocSingular, statmod, edgeR, scuttle, gridBase, rngtools, pkgmaker, scran, NMF, snowfall

Warning message:
“dependency ‘Matrix (>= 1.5.0)’ is not available”
Warning message in i.p(…):
“installation of package ‘DelayedMatrixStats’ had non-zero exit status”
Warning message in i.p(…):
“installation of package ‘scuttle’ had non-zero exit status”
Warning message in i.p(…):
“installation of package ‘scran’ had non-zero exit status”

checking for file ‘/tmp/Rtmp6tQ3CV/remotes16b72b1d6904/Danko-Lab-BayesPrism-1ad3e82/BayesPrism/DESCRIPTION’ … OK
preparing ‘BayesPrism’:
checking DESCRIPTION meta-information … OK
checking for LF line-endings in source and make files and shell scripts
checking for empty or unneeded directories
building ‘BayesPrism_2.0.tar.gz’
Warning message in i.p(…):
“installation of package ‘/tmp/Rtmp6tQ3CV/file16b7302681db/BayesPrism_2.0.tar.gz’ had non-zero exit status”

Any clue what could it be ?

Thanks a lot !

A.Coudray
PhD student

Code availability for downstream analyses

Hello,

I was wondering where I might be able to find the code used to generate figures from the BayesPrism publication related to downstream analyses such as GSEA-based interpretation of latent embeddings (e.g. Figure 4).

Thank you,
Daniel

question about input data in BayesPrim

Hello. Thank you for the amazing tool BayesPrim. I am wondering if I have bulk RNA seq data under A, B, C, and D conditions. But I only have the single cell sequence data under condition A. Can I still feed the BayesPrim with all the data to infer the cell type composition in bulk seq data under condition B, C, and D?

The proportion of cell population in input data

Hi, very useful tool!
I wonder if the ratio of cell types in the input data is needed to maintain their true biological fraction, or just a certain number of cells is enough?

Where is the bk.data in tutorial.dat from?

I tried TCGA-GBM and didnt found overlap samples, maybe from another dataset?

BayesPrism tutorial

Hi,

I would like to learn to run BayesPrism on some bulk RNA seq data as part of a research project. However, whenever I try to open the tutorial_deconvolution.html and tutorial_embedding_learning.html files (by clicking on View Raw) I get a lot of text which I do not think is how the tutorial is intended to look

Error in get.exp.stat

Dear Sir or Madam,

Thanks for developing this.
when I run the code：
diff.exp.stat <- get.exp.stat(sc.dat=sc.dat[,colSums(sc.dat>0)>3],
cell.type.labels=cell.type.labels,
cell.state.labels=cell.state.labels,
pseudo.count=0.1,
cell.count.cutoff=50,
n.cores=10 #number of threads )
I got an error with the following:

Error: logical subscript contains NAs

How to solve it? thanks you!

Error in checkForRemoteErrors(val) : one node produced an error: subscript out of bounds

I'm bulk RNA-seq data and sc-RNA data with Ensembl Ids. And using cell type Fibroblasts.

> dim(bk.dat)
[1]   546 19988

> dim(sc.dat)
[1]  1159 19828

> sort(table(cell.type.labels))
Fibroblast 
      1159 
> sort(table(cell.state.labels))
cell.state.labels
fb-4 fb-1 fb-3 fb-2 fb-0 
  83  148  177  237  514

Using the above data Constructed prism object like below:

> myPrism <- new.prism(
+   reference=sc.dat.filtered.pc,
+   mixture=bk.dat,
+   input.type="count.matrix",
+   cell.type.labels = cell.type.labels,
+   cell.state.labels = cell.state.labels,
+   key="Fibroblast",
+   outlier.cut=0.01,
+     outlier.fraction=0.1,
+ )
number of cells in each cell state
cell.state.labels
fb-4 fb-1 fb-3 fb-2 fb-0
  83  148  177  237  514
Number of outlier genes filtered from mixture = 9
Aligning reference and mixture...
Nornalizing reference...
Warning message:
In validityMethod(object) : Warning: pseudo.min does not match min(phi)

Then ran the Bayesprism like below:

> bp.res <- run.prism(prism = myPrism, n.cores=50)
Run Gibbs sampling...
Current time:  2022-12-02 21:03:14
Estimated time to complete:  1hrs 2mins
Estimated finishing time:  2022-12-02 22:04:15
Start run...
Explicit sfStop() is missing: stop now.

Stopping cluster

snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 50 CPUs.

Stopping cluster

Update the reference matrix ...
snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 50 CPUs.

Error in checkForRemoteErrors(val) :
  one node produced an error: subscript out of bounds

So, first I saw Explicit sfStop() is missing: stop now. then at the end of the run I saw the following error:

Error in checkForRemoteErrors(val) :
  one node produced an error: subscript out of bounds

Could you please help, how to resolve this error? thank you.

Deconvolution of cancer subtypes

Hi guys,

I am using BayesPrim to deconvolve a breast cancer dataset, in which there are 5 sub-types of cancer epithelial cells.

I'm more interested in the cancer sub-type populations (cell.states) than the cancer population itself (cell.types).

Just wondering if there's any way I can get the updated cell.states fractions? Or perhaps an option to specific multiple keys for cancer populations in new.prism()?

Cheers,
Khoa.

Whether I can use single nuclei RNA-seq data as input

I am wondering whether I can use single nuclei RNA-seq data as input. Compared with scRNA-seq, snRNA-seq tends to capture the mRNA inside the nuclei. I think this might introduce some potential biases.

Question on gene expression patterns correlated with TME cell types.

I have read the tutorial of the package, and it mentioned that "Correlating Z (after normalization using vst or from bp.res@reference.update@psi_mal) with theta to understand how gene expression of each gene (in malignant cells) correlates with the cell type fraction of non-malignant cells in tumor microenvironment". So should I use the two filters mentioned in the original papers, or use the genes from Z directly?

Errors when running BayesPrism with gene symbles.

Hi,

I was running BayesPrism using gene symbols and met this error when running functions plot.scRNA.outlier and plot.bulk.outlier:

" Error in input.genes.short %in% gene.df[gene.df[, 1] == gene.group.i, :
object 'input.genes.short' not found "

I found it's due to the function "assign.category" in "process_input.R" and made a very slight change to make it work:
#detect if EMSEMBLE ID (starts with ENS) or gene symbol is used
if( sum(substr(input.genes,1,3)=="ENS")> length(input.genes)*0.8 ){
cat("EMSEMBLE IDs detected.\n")
input.genes.short <- unlist(lapply(input.genes, function(gene.id) strsplit(gene.id,split="\.")[[1]][1]))
gene.df <- gene.list[,c(1,2)]
gene.group.matrix <- do.call(cbind.data.frame, lapply(unique(gene.df[,1]),
function(gene.group.i) input.genes.short %in% gene.df[gene.df[,1]== gene.group.i,2]))
}
else{
cat("Gene symbols detected. Recommend to use EMSEMBLE IDs for more unique mapping.\n")
gene.df <- gene.list[,c(1,3)]
gene.group.matrix <- do.call(cbind.data.frame, lapply(unique(gene.df[,1]),
function(gene.group.i) input.genes %in% gene.df[gene.df[,1]== gene.group.i,2]))
}

Hope it helps : )

Thanks!
Shuai

Would be package update cell state estimates?

It is important to use the updated estimates of cell type compositions in the output for better results. However, would the package also provide updated estimates of cell state composition ?

Thanks!

License

Hello,

I was wondering what the software license would be for BayesPrism?

Best,
Chang

Can I use metacell data as a reference?

Thank you for your excellent algorithm,
I have a question that if I use metacell data, which means a metacell is a sum of dozens of single cells, as a reference, whether the deconvolution result is accurate.
Best wishes!
J Hovelly.

BayesPrism on a HPC server error

Hello!
After I launched the experiment, it reported an error message: (I've no idea what's "Error: Stop [err5]" meaning for~)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 240 100 120 100 120 1333 1333 --:--:-- --:--:-- --:--:-- 2637
Loading required package: BayesPrism
Loading required package: snowfall
Loading required package: snow
Loading required package: NMF
Loading required package: pkgmaker
Loading required package: registry
Loading required package: rngtools
Loading required package: cluster
NMF - BioConductor layer [OK] | Shared memory capabilities [OK] | Cores 127/128
Warning messages:
1: replacing previous import 'gplots::lowess' by 'stats::lowess' when loading 'BayesPrism'
2: replacing previous import 'BiocParallel::register' by 'NMF::register' when loading 'BayesPrism'
Error: Stop [err5]
Execution halted
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 242 100 120 100 122 2033 2067 --:--:-- --:--:-- --:--:-- 4172

Minimum number of cells per cell state

Hi Tinyi,

While running new.prism() function, there is an error : "recommend to have sufficient number of cells in each cell state
Error in new.prism(reference = sc.dat.filtered.pc, mixture = bk.dat, input.type = "count.matrix", :
Error: one or more cell states belong to multiple cell types!"

I have 2 questions:

How many cells should be minimally per cell state?
Wouldn't cell states be part of multiple cell types?

error: could not find function "Rcgminu"

Hello,

I am running a deconvolution on R/4.2.0 using BayesPrism. I have followed your tutorial without error until the run.prism() step:

> bp.res <- run.prism(prism = myPrism, n.cores=1)
Run Gibbs sampling... 
Current time:  2022-09-26 08:22:58 
Estimated time to complete:  8hrs 4mins 
Estimated finishing time:  2022-09-26 16:26:16 
Start run... 
1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129  130  131  132  133  134  
Update the reference matrix ... 
Explicit sfStop() is missing: stop now.

Stopping cluster

snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 1 CPUs.

Error in checkForRemoteErrors(val) : 
  one node produced an error: could not find function "Rcgminu"

I can see that Rcgminu() is an exported function from the BayesPrism package and is available when the package is loaded. I have also tried running this with the Rcgmin package loaded and received the same error.

Here is my session info:

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
[3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BayesPrism_2.0              NMF_0.24.0                  cluster_2.1.3              
 [4] rngtools_1.5.2              pkgmaker_0.32.2             registry_0.5-1             
 [7] snowfall_1.84-6.2           snow_0.4-4                  DESeq2_1.36.0              
[10] SummarizedExperiment_1.26.1 Biobase_2.56.0              MatrixGenerics_1.8.0       
[13] matrixStats_0.62.0          GenomicRanges_1.48.0        GenomeInfoDb_1.32.2        
[16] IRanges_2.30.0              S4Vectors_0.34.0            BiocGenerics_0.42.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                bit64_4.0.5                 doParallel_1.0.17          
 [4] RColorBrewer_1.1-3          httr_1.4.4                  tools_4.2.0                
 [7] irlba_2.3.5                 utf8_1.2.2                  R6_2.5.1                   
[10] KernSmooth_2.23-20          DBI_1.1.3                   colorspace_2.0-3           
[13] withr_2.5.0                 tidyselect_1.1.2            bit_4.0.4                  
[16] compiler_4.2.0              cli_3.3.0                   BiocNeighbors_1.14.0       
[19] DelayedArray_0.22.0         caTools_1.18.2              scales_1.2.0               
[22] genefilter_1.78.0           stringr_1.4.0               digest_0.6.29              
[25] XVector_0.36.0              pkgconfig_2.0.3             sparseMatrixStats_1.8.0    
[28] limma_3.52.1                fastmap_1.1.0               rlang_1.0.4                
[31] rstudioapi_0.13             RSQLite_2.2.14              DelayedMatrixStats_1.18.0  
[34] generics_0.1.3              BiocParallel_1.30.2         gtools_3.9.2.1             
[37] dplyr_1.0.9                 RCurl_1.98-1.6              magrittr_2.0.3             
[40] BiocSingular_1.12.0         scuttle_1.6.3               GenomeInfoDbData_1.2.8     
[43] Matrix_1.4-1                Rcpp_1.0.8.3                munsell_0.5.0              
[46] fansi_1.0.3                 lifecycle_1.0.1             edgeR_3.38.4               
[49] stringi_1.7.6               zlibbioc_1.42.0             gplots_3.1.3               
[52] plyr_1.8.7                  grid_4.2.0                  blob_1.2.3                 
[55] dqrng_0.3.0                 parallel_4.2.0              crayon_1.5.1               
[58] lattice_0.20-45             Biostrings_2.64.0           beachmat_2.12.0            
[61] splines_4.2.0               annotate_1.74.0             KEGGREST_1.36.3            
[64] locfit_1.5-9.5              metapod_1.4.0               pillar_1.8.1               
[67] igraph_1.3.1                geneplotter_1.74.0          reshape2_1.4.4             
[70] codetools_0.2-18            ScaledMatrix_1.4.0          XML_3.99-0.9               
[73] glue_1.6.2                  scran_1.24.1                png_0.1-7                  
[76] vctrs_0.4.1                 foreach_1.5.2               gtable_0.3.0               
[79] purrr_0.3.4                 assertthat_0.2.1            cachem_1.0.6               
[82] ggplot2_3.3.6               BinfTools_0.0.0.9000        gridBase_0.4-7             
[85] rsvd_1.0.5                  xtable_1.8-4                survival_3.3-1             
[88] SingleCellExperiment_1.18.0 tibble_3.1.7                iterators_1.0.14           
[91] AnnotationDbi_1.58.0        memoise_2.0.1               statmod_1.4.37             
[94] bluster_1.6.0               ellipsis_0.3.2

Please let me know if you need any other information. Any help would be greatly appreciated.

Similar error while running `plot.cor.phi` and `plot.scRNA.outlier` functions ('x' must be an array of at least two dimensions)

Hello @tinyi and Team,
I'm getting similar errors while running plot.cor.phi and plot.scRNA.outlier functions. I've followed the vignette (Tutorial: bulk RNA-seq deconvolution using BayesPrism by Tinyi Chu) step-by-step to generate the input files correctly. Please find the commands, respective errors, and some cells giving information about the input files that I've prepared for BayesPrism.

Can you please look into this and help me resolve it?

plot.cor.phi(input = sc.dat,
            input.labels = cell.state.labels,
            title = "cell state correlation",
            # specify pdf.prefix if need to output to pdf
            # pdf.prefix = "BayesPrism.crc.cor.cs", 
            cexRow = 0.2, cexCol = 0.2, min.exp = 3,
            margins = c(2,2))

Error in h(simpleError(msg, call)): error in evaluating the argument 'j' in selecting a method for function '[': 'x' must be an array of at least two dimensions
Traceback:

plot.cor.phi(input = sc.dat, input.labels = cell.state.labels,
. title = "cell state correlation", cexRow = 0.2, cexCol = 0.2,
. min.exp = 3, margins = c(2, 2))

input[, colSums(input) >= min.exp]

colSums(input)

stop("'x' must be an array of at least two dimensions")

.handleSimpleError(function (cond)
. .Internal(C_tryCatchHelper(addr, 1L, cond)), "'x' must be an array of at least two dimensions",
. base::quote(colSums(input)))

h(simpleError(msg, call))

sc.stat <- plot.scRNA.outlier(
  input = sc.dat, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels = cell.type.labels,
  species = "hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw = TRUE #return the data used for plotting. 
  # pdf.prefix = "BayesPrism.crc.sc.stat" # specify pdf.prefix if need to output to pdf
)

Error in colSums(ref[labels == label.i, , drop = F]): 'x' must be an array of at least two dimensions
Traceback:

plot.scRNA.outlier(input = sc.dat, cell.type.labels = cell.type.labels,
. species = "hs", return.raw = TRUE)

collapse(ref = input, labels = cell.type.labels)

do.call(rbind, lapply(labels.uniq, function(label.i) colSums(ref[labels ==
. label.i, , drop = F])))

lapply(labels.uniq, function(label.i) colSums(ref[labels == label.i,
. , drop = F]))

FUN(X[[i]], ...)

colSums(ref[labels == label.i, , drop = F])

stop("'x' must be an array of at least two dimensions")

For reference:

class(bk.dat)
class(sc.dat)
class(cell.type.labels)
class(cell.state.labels)

'matrix''array'
'dgCMatrix'
'character'
'character'

dim(bk.dat)
dim(sc.dat)
length(cell.type.labels)
length(cell.state.labels)

59218184
61015018184
610150
610150

head(bk.dat)
head(sc.dat)
head(cell.type.labels)
head(cell.state.labels)

RNU12-2P EFCAB8 TRIM75P GTPBP6 EFCAB12 A1BG A1CF A2M A2ML1 A4GALT ... ZWILCH ZWINT ZXDA ZXDB ZXDC ZYG11A ZYG11B ZYX ZZEF1 ZZZ3

TCGA.3L.AA1B.01 1.9342 2.4178 0.4836 1033.850 0.0000 22.1470 220.987 15911.50 0.4836 118.9560 ... 403.641 629.594 71.0832 461.315 1105.42 3.3849 543.037 6259.19 1358.32 798.356

TCGA.4N.A93T.01 0.4838 2.4190 0.0000 1817.610 1.4514 171.2680 100.629 1494.33 0.4838 22.2545 ... 186.686 442.187 39.6710 366.715 1149.49 0.4838 290.760 4653.12 1220.13 333.817

TCGA.4T.AA8H.01 2.9245 2.9245 0.0000 719.430 0.7311 20.9980 174.008 1333.57 36.5564 16.0848 ... 520.782 1033.080 31.4385 349.479 1083.53 0.0000 669.713 4460.61 3002.01 530.068

TCGA.5M.AAT4.01 2.1515 2.1515 0.8606 879.948 1.7212 6.4587 151.463 2424.26 6.8847 75.7315 ... 468.408 1629.090 54.6472 542.169 1374.35 0.4303 445.353 4190.19 1093.37 574.441

TCGA.5M.AAT5.01 0.9892 8.9030 0.0000 934.819 1.4838 14.8384 255.715 2398.34 0.9892 41.5475 ... 663.533 838.864 29.1822 428.335 1240.98 3.4623 550.504 3878.26 1016.43 413.002

TCGA.5M.AAT6.01 1.3125 4.5937 0.0000 605.049 3.9374 49.8017 0.000 7231.65 2.6249 161.4340 ... 600.771 1338.720 45.9365 335.337 1056.54 13.7810 492.833 6165.99 1390.56 717.266
  [[ suppressing 34 column names 'OR4F5', 'OR4F29', 'FAM41C' ... ]]
6 x 18184 sparse Matrix of class "dgCMatrix"
cell1 . . . . . . . . 2 . . . .  1 3 1 . . . . . . . . . 4 1

cell2 . . . . . . . . 2 1 . . .  . . . . . . . . 1 . . . 1 .

cell3 . . . . . . . . 1 1 . . .  . . . . 1 . . . . . . . . .

cell4 . . . . 1 1 . 2 9 2 . . . 13 5 . . . . . . . . . . 3 1

cell5 . . . . . . . . . . . . .  . . . . . . . . . . . . . 1

cell6 . . . . . . . . 2 . . . .  1 2 1 . 2 . . . . . 1 1 5 1
cell7 2 . . . . . . ......

cell8 . . . 4 . . . ......

cell9 . . . 1 . . . ......

cell10 2 . . 8 1 . . ......

cell11 2 . . . . . . ......

cell12 . . . 4 . . . ......
.....suppressing 18150 columns in show(); maybe adjust 'options(max.print= *, width = *)'

..............................
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'
'Endothelial'

Error in if (any(which.row)) { : missing value where TRUE/FALSE needed

Hi, BayesPrism team

I've played this package for a while and it performes very stable. But today I encountered an error like this:

recommend to have sufficient number of cells in each cell state
Number of outlier genes filtered from mixture = 44
Aligning reference and mixture...
Nornalizing reference...
Warning message:
In new.prism(reference = sc.dat.filtered.pc, mixture = bk.dat, input.type = "count.matrix",  :
  Warning: very few gene from reference and mixture match! Please double check your gene names.
Run Gibbs sampling...
Current time:  2022-07-19 02:02:40
Estimated time to complete:  2mins
Estimated finishing time:  2022-07-19 02:04:01
Start run...
R Version:  R version 4.1.2 (2021-11-01)

snowfall 1.84-6.1 initialized (using snow 0.4-4): parallel execution on 50 CPUs.


Stopping cluster

Update the reference matrix ...
Error in if (any(which.row)) { : missing value where TRUE/FALSE needed
Calls: run.prism -> updateReference -> get.MLE.psi_mal -> norm.to.one
In addition: Warning message:
In searchCommandline(parallel, cpus = cpus, type = type, socketHosts = socketHosts,  :
  Unknown option on commandline: --file
Execution halted

It's kind of weird because the the query data (83 samples) is always the same in the past few succeed trials. The only difference is that the number of genes in it is reduced from 2000 to 901 like this

> dim(bk.dat)
[1]  83 901

Besides, if I subset some of the 83 samples and add part/all them up to make a artifical dataset like this.

data$testA1 = with(data, CR190516 + CR190517 + CR190518 + CR190522 + CR190523) ;
data$testA2 = with(data, CR190516 + CR190517 + CR190518 ) ;
data$testA3 = with(data, CR190518 + CR190522 + CR190523) ;
data$testA4 = with(data, CR190516 );
data$testA5 = with(data, CR190517 );
data$testA6 = with(data, CR190518 );
data$testA7 = with(data, CR190522 );
data$testA8 = with(data, CR190523 );
testSamples=colnames(data)[grepl("test", colnames(data))]
test = data[,testSamples]
rownames(test) = rownames(data) ;
testData=t(test) ;
bk.dat <- testData

No errors showed up.

Is the error coming from some of my samples?

Using sorted bulk data as reference

Dear developers,

First off, I'd like to express my appreciation for the development of this method.

I'm interested in using TPM normalized bulk references of sorted cell types. This includes both RNA-seq and micro-array based references. According to your tutorial, it seems this can be accomplished by setting input.type = "GEP".

However, I am not sure what would be the correct way to generate markers for each type.
The get.exp.stat and select.marker functions are designed for scRNA-seq data.

A potential workaround I'm considering manually performing DE analysis in a pairwise manner for each cell type, and then selecting the most discriminating genes as markers. However, I would like to verify if this approach is okay or if there's a better method available.

Any advice or guidance you can offer would be greatly appreciated.

Almog

library(BayesPrism) return Warning messages

Hello,
When I import the BayesPrism R package, the following warning occurs:

My operating system is Linux and my R version is 4.2.0.

Thanks for your help!

Issue with labelling of reference scRNA-seq data

Hello,

there are some guidelines in the tutorial to annotate the reference scRNA-seq data, for example:

"Define cell types as the cluster of cells having a sufficient number of significantly differentially expressed genes than other cell types, e.g., greater than 50 or even 100"
"Define multiple cell states for cell types of significant heterogeneity, such as malignant cells, and of interest to deconvolve their transcription."

But I could not find any suggested workflow to annotate the reference data.

What would be the ideal way to generate the cell label and cell state annotations? Are there any suggested tools I could use?

Thank you and best regards

Error in validObject(.Object): invalid class “prism” object: invalid key

Hello !

I just updated R to v.4.2, and now I downloaded the new BayesPrism (I had used the one in TED with Rv3.6.3 with no issues).

When I try to launch BayesPrism I am getting this error message :

number of cells in each cell state
cell.state.labels
SUDHL4 THP1 HL60 K562
551 562 570 659
Number of outlier genes filtered from mixture = 3
Aligning reference and mixture...
Nornalizing reference...
Error in validObject(.Object): invalid class “prism” object: invalid key
Traceback:

new.prism(reference = t(exprs(sc.es)), mixture = t(exprs(bulk.es)),
. input.type = "count.matrix", cell.type.labels = as.character(sc.es$cellType),
. cell.state.labels = as.character(sc.es$cellType), key = "tumor",
. outlier.cut = 0.01, outlier.fraction = 0.1, )
new("prism", phi_cellState = new("refPhi", phi = ref.cs, pseudo.min = pseudo.min),
. phi_cellType = new("refPhi", phi = ref.ct, pseudo.min = pseudo.min),
. map = map, key = key, mixture = mixture)
initialize(value, ...)
initialize(value, ...)
validObject(.Object)
stop(msg, ": ", errors, domain = NA)

Could you please help me to resolve it ? Thanks in advance !

Regards,

Alexandre Coudray
PhD Student at EPFL

could not find function "sample.Z.theta_n"

Hi all,

I've tested the program on my local machine and it works after significantly shrinking my datasets. So I've moved to a compute cluster to try to get things working on a less scaled down version of my data. I got the error below.

I guess my question is will BayesPrism will work if the bigmemory package is not installed? It isn't currently available on our cluster and I'm guessing this is where the error originates.

Run Gibbs sampling... 
Current time:  2022-08-31 09:22:58 
Estimated time to complete:  29mins 
Estimated finishing time:  2022-08-31 09:51:32 
Start run... 
R Version:  R version 4.2.0 (2022-04-22) 

snowfall 1.84-6.2 initialized (using snow 0.4-4): parallel execution on 32 CPUs.

Error in checkForRemoteErrors(val) : 
  8 nodes produced errors; first error: could not find function "sample.Z.theta_n"
> traceback()
12: stop(count, " nodes produced errors; first error: ", firstmsg)
11: checkForRemoteErrors(val)
10: staticClusterApply(cl, fun, length(x), argfun)
9: clusterApply(cl, splitList(x, length(cl)), lapply, fun, ...)
8: lapply(args, enquote)
7: do.call("fun", lapply(args, enquote))
6: docall(c, clusterApply(cl, splitList(x, length(cl)), lapply, 
       fun, ...))
5: parLapply(sfGetCluster(), x, fun, ...)
4: sfLapply(1:nrow(X), cpu.fun)
3: run.gibbs.refPhi(gibbsSampler.obj = gibbsSampler.obj, final = final, 
       compute.elbo = compute.elbo)
2: run.gibbs(gibbsSampler.ini.cs, final = FALSE)
1: run.prism(prism = myPrism, n.cores = 32)

select.marker can't return marker genes of stromal cells

when I check the result of function select.marker, sc.dat.filtered.pc.sig, I found it return all rows of NA in stromal cells, while other cells didn't. How to solve it? thanks you!

Cell states vs cell types

Hi, I'm having trouble understanding the cell states vs cell types. In my data I have identified cell types from multiple samples that come from 2 major groups, let's say disease and control. There is not substantial variability in disease vs control in different cell types, so I'm having trouble understanding how to specify the correct cell state in this case: when examining umap, they cluster by cell type and not by group.

If I specify cell states as group+sample, I will end up getting cell states being spread across cell types understandably:

cell_state      endothelial      macrophages    pericyte    ....
disease_1     300                 70                    0
disease_2     199                 50                    80
healthy_1     500                 130                  550

This means that I do not get substantial variability per sample (or source) to elicit distinct cell types. It also means I get the error Error: one or more cell states belong to multiple cell types!.

Likewise, in the example that you show in the tutorial you could easily have them spread across cell types. How would you recommend addressing this?

Doubts regarding cell state and cell label; log transformed input

Thanks for developing this.
1.
I have a basic doubt regarding what cell state and cell label means.
So, per my understanding cell label means what cell it is such as astrocyte, glia, etc
What does the cell state mean here, should i cluster the single cell data such that the cells represent the state? Such that cells with similar transcriptional state represent similar state. Such that astrocyte and astrocyte like cell belong to state 1 and so on ?

To circumvent this I made the cell label as cell state and ran the program, is this okay?

However when I run the program I get
Warning: input seems to be log-transformed. Please double check your input. Log transformation should be avoided

However I am choosing the counts slot of seurat object which seems to contain raw counts.
Is there any way I can check if this is the issue with the dataset or why exactly this error comes up?

Thanks again!

deconvolution of spatial data

Hey guys,

Do you think BayesPrism can be used for deconvolution of spatial transcriptomics datasets, to get cell-type specific expression profiles?

Error in get.exp.stat function

Hello, thank you for creating this great package.

I met an Error while running the get.exp.stat function:
Error: logical subscript contains NAs

I checked the original code, I think the error might due to NA output while filtering out comparisons for cell states from the same cell type.
#filter out comparisons for cell states from the same cell type pairs.celltype.first <- ct.to.cst[match(fit.up$pairs$first, ct.to.cst[,"cell.state"]),"cell.type"] pairs.celltype.second <- ct.to.cst[match(fit.up$pairs$second, ct.to.cst[,"cell.state"]),"cell.type"]

and I made a small change.
I change the
ct.to.cst <- unique(cbind(cell.type=cell.type.labels, cell.state=cell.state.labels))
to
ct.to.cst <- unique(cbind.data.frame(cell.type=cell.type.labels, cell.state=cell.state.labels))

Then it worked well.

I'm not sure if this is a problem with my cell naming or a general problem.

Thank you very much.

Prebuilt docker image?

Hi, do you have an official docker image for this?

If not, are you open to a pull request for one? I am developing one, and I think this might be useful for some folks.

	RNU12-2P	EFCAB8	TRIM75P	GTPBP6	EFCAB12	A1BG	A1CF	A2M	A2ML1	A4GALT	...	ZWILCH	ZWINT	ZXDA	ZXDB	ZXDC	ZYG11A	ZYG11B	ZYX	ZZEF1	ZZZ3
TCGA.3L.AA1B.01	1.9342	2.4178	0.4836	1033.850	0.0000	22.1470	220.987	15911.50	0.4836	118.9560	...	403.641	629.594	71.0832	461.315	1105.42	3.3849	543.037	6259.19	1358.32	798.356
TCGA.4N.A93T.01	0.4838	2.4190	0.0000	1817.610	1.4514	171.2680	100.629	1494.33	0.4838	22.2545	...	186.686	442.187	39.6710	366.715	1149.49	0.4838	290.760	4653.12	1220.13	333.817
TCGA.4T.AA8H.01	2.9245	2.9245	0.0000	719.430	0.7311	20.9980	174.008	1333.57	36.5564	16.0848	...	520.782	1033.080	31.4385	349.479	1083.53	0.0000	669.713	4460.61	3002.01	530.068
TCGA.5M.AAT4.01	2.1515	2.1515	0.8606	879.948	1.7212	6.4587	151.463	2424.26	6.8847	75.7315	...	468.408	1629.090	54.6472	542.169	1374.35	0.4303	445.353	4190.19	1093.37	574.441
TCGA.5M.AAT5.01	0.9892	8.9030	0.0000	934.819	1.4838	14.8384	255.715	2398.34	0.9892	41.5475	...	663.533	838.864	29.1822	428.335	1240.98	3.4623	550.504	3878.26	1016.43	413.002
TCGA.5M.AAT6.01	1.3125	4.5937	0.0000	605.049	3.9374	49.8017	0.000	7231.65	2.6249	161.4340	...	600.771	1338.720	45.9365	335.337	1056.54	13.7810	492.833	6165.99	1390.56	717.266

danko-lab / bayesprism Goto Github PK

bayesprism's People

Contributors

Stargazers

Watchers

Forkers

bayesprism's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs