GithubHelp home page GithubHelp logo

soupx's Introduction

SoupX

An R package for the estimation and removal of cell free mRNA contamination in droplet based single cell RNA-seq data.

The problem this package attempts to solve is that all droplet based single cell RNA-seq experiments also capture ambient mRNAs present in the input solution along with cell specific mRNAs of interest. This contamination is ubiquitous and can vary hugely between experiments (2% - 50%), although around 10% seems reasonably common.

There's no way to know in advance what the contamination is in an experiment, although solid tumours and low-viability cells tend to produce higher contamination fractions. As the source of the contaminating mRNAs is lysed cells in the input solution, the profile of the contamination is experiment specific and produces a batch effect.

Even if you decide you don't want to use the SoupX correction methods for whatever reason, you should at least want to know how contaminated your data are.

NOTE: From v1.3.0 onward SoupX now includes an option to automatically estimate the contamination fraction. It is anticipated that this will be the preferred way of using the method for the vast majority of users. This function (autoEstCont) depends on clustering information being provided. If you are using 10X data mapped with cellranger, this will be loaded automatically, but otherwise it must be provided explicitly by the user using setClusters.

Installation

The latest stable release can be installed from CRAN in the usual way by running,

install.packages('SoupX')

If you want to use the latest development version, install it by running,

devtools::install_github("constantAmateur/SoupX",ref='devel')

Finally, if you want to use the per-cell contamination estimation (which you almost certainly won't need to), install the branch STAN

devtools::install_github("constantAmateur/SoupX",ref='STAN')

If you encounter errors saying multtest is unavailable, please install this manually from bioconductor with:

BiocManager::install('multtest')

Quickstart

Decontaminate one channel of 10X data mapped with cellranger by running:

sc = load10X('path/to/your/cellranger/outs/folder')
sc = autoEstCont(sc)
out = adjustCounts(sc)

or to manually load decontaminate any other data

sc = SoupChannel(table_of_droplets,table_of_counts)
sc = setClusters(sc,cluster_labels)
sc = autoEstCont(sc)
out = adjustCounts(sc)

out will then contain a corrected matrix to be used in place of the original table of counts in downstream analyses.

Documentation

The methodology implemented in this package is explained in detail in this paper.

A detailed vignette is provided with the package and can be viewed here.

Citing SoupX

If you use SoupX in your work, please cite: "Young, M.D., Behjati, S. (2020). SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data, GigaScience, Volume 9, Issue 12, December 2020, giaa151bioRxiv, 303727, https://doi.org/10.1093/gigascience/giaa151"

Frequently Asked Questions

I'm getting errors from autoEstCont or unrealistic estimates

The automatic estimation of the contamination implemented in autoEstCont makes the assumption that there is sufficient diversity in the raw data to identify marker genes (as such genes are commonly useful for estimating the contamination). If your data is either extremely homogenous (i.e., all one cell type, for example a cell line) or your number of cells is very low (a few hundred or less), then this assumption is unlikely to hold. In such situations you should think hard about if you really want to include data with such severe limitations. But if you're sure you do, the best approach is probably to manually specify a contamination fraction in line with what you would expect from similar experiments.

My data still looks contaminated. Why didn't SoupX work?

The first thing to do is check that you are providing clustering information, either by doing clustering yourself and running setClusters before adjustCounts or by loading it automatically from load10X. Cluster information allows far more contamination to be identified and safely removed.

The second thing to consider is if the contamination rate estimate looks plausible. As estimating the contamination rate is the part of the method that requires the most user input, it can be prone to errors. Generally a contamination rate of 2% or less is low, 5% is usual, 10% moderate and 20% or above very high. Of course your experience may vary and these expectations are based on fresh tissue experiments on the 10X 3' platform.

Finally, note that SoupX has been designed to try and err on the side of not throwing out real counts. In some cases it is more important to remove contamination than be sure you've retained all the true counts. This is particularly true as "over-removal" will not remove all the expression from a truly expressed gene unless you set the over-removal to something extreme. If this describes your situation you may want to try manually increasing the contamination rate by setting setContaminationFraction and seeing if this improves your results.

I can't find a good set of genes to estimate the contamination fraction.

Generally the gene sets that work best are sets of genes highly specific to a cell type that is present in your data at low frequency. Think HB genes and erythrocytes, IG genes and B-cells, TPSB2/TPSAB1 and Mast cells, etc. Before trying anything more esoteric, it is usually a good idea to at least try out the most commonly successful gene sets, particularly HB genes. If this fails, the plotMarkerDistribution function can be used to get further inspiration as described in the vignette. If all of this yields nothing, we suggest trying a range of corrections to see what effect this has on your downstream analysis. In our experience most experiments have somewhere between 2-10% contamination.

estimateNonExpressingCells can't find any cells to use to estimate contamination.

At this point we assume that you have chosen a set (or sets) of genes to use to estimate the contamination. The default behaviour (with 10X data) is to look for cells with strong evidence of endogenous expression of these gene sets in all cells, then exclude any cluster with a cell that has strong evidence of endogenous expression. This conservative behaviour is designed to stop the over-estimation of the contamination fraction, but can sometimes make estimation difficult. If all clusters have at least one cell that "looks bad" you have 3 options.

  1. Recluster the data to produce more clusters with fewer cells per cluster. This is the preferred option, but requires more work on the users part.
  2. Make the criteria for declaring a cell to be genuinely expressing a gene set less strict. This seldom works, as usually when a cell is over the threshold, it's over by a lot. But in some cases tweaking the values maximumContamination and/or pCut can yield usable results.
  3. Set clusters=FALSE to force estimateNonExpressingCells to consider each cell independently. If you are going to do this, it is worth making the criteria for excluding a cell more permissive by decreasing maximumContamination as much as is reasonable.

Changelog

v1.6.0

  • Added some checks and security around setting of clusters with setClusters.
  • Add warnings and extra documentation about library complexity to autoEstCont.
  • Fix pointSize bug in plotting function.
  • Merge pull adding support for multi.

v1.5.0

load10X now requires the version of Seurat::Read10X that does not strip out the numeric suffix.

v1.4.5

First CRAN version of the code. The one significant change other than tweaks to reach CRAN compatibility is that the correction algorithm has been made about 20 times faster. As such, the parallel option was no longer needed and has been removed. Also includes some other minor tweaks.

v1.3.6

Addition of autoEstCont function to automatically estimate the contamination fraction without the need to specify a set of genes to use for estimation. A number of other tweaks and bug fixes.

v1.2.1

Some bug fixes from v1.0.0. Added some helper functions for integrating metadata into SoupChannel object. Further integration of cluster information in estimation of contamination and calculation of adjusted counts. Make the adjustCounts routine parallel.

v1.0.0

Review of method, with focus on simplification of code. Functions that were being used to "automate" selection of genes for contamination estimation have been removed as they were being misused. Clustering is now used to guide selection of cells where a set of genes is not expressed. Default now set to use global estimation of rho. A hierarchical bayes routine has been added to share information between cells when the user does use cell specific estimation. See NOTE for further details.

v0.3.0

Now passes R CMD check without warnings or errors. Added extra vignette on estimating contamination correctly. Changed the arguments for the interpolateCellContamination function and made monotonically decreasing lowess the default interpolation method. A number of other plotting improvements.

v0.2.3

Added lowess smoothing to interpolation and made it the default. Modified various functions to allow single channel processing in a more natural way. Some minor bug fixes.

v0.2.2

Integrated estimateSoup into class construction to save memory when loading many channels. Added function to use tf-idf to quickly estimate markers. Some minor bug fixes and documentation updates.

v0.2.1

Update documentation and modify plot functions to return source data.frame.

v0.2.0

A fairly major overhaul of the data structures used by the package. Not compatible with previous versions.

v0.1.1

Some bug fixes to plotting routines.

License

Copyright (c) 2018 Genome Research Ltd. 
Author: Matthew Young <[email protected]> 
 
This program is free software: you can redistribute it and/or 
modify it under the terms of the GNU General Public License version 3 
as published by the Free Software Foundation. 

This program is distributed in the hope that it will be useful, 
but WITHOUT ANY WARRANTY; without even the implied warranty of 
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 
General Public License for more details <http://www.gnu.org/licenses/>. 

soupx's People

Contributors

constantamateur avatar gtca avatar mschilli87 avatar yihui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

soupx's Issues

load count matrix

Hi,

Thanks for developing such a neat program.
For the latest version, you removed the function "SoupChannelList", so how should I load the count expression matrix instead of 10X data from cellranger? Thanks.

Error when running calculateContaminationFraction with cellSpecificEstimates = True

When running calculateContaminationFraction as follows:

scveh <- calculateContaminationFraction(scveh, nonExpressedGeneList = genelist, cellSpecificEstimates = TRUE, useToEst = usetoestveh)

I get the following error message:

Error in compileCode(f, code, language = language, verbose = verbose) :
Compilation ERROR, function(s)/method(s) not created! g++: error: unrecognized command line option ‘-std=gnu++14’
make: *** [file1467411a2bdf9.o] Error 1
In addition: Warning message:
In system(cmd, intern = !verbose) :
running command '/scg/apps/software/r/3.6/lib64/R/bin/R CMD SHLIB file1467411a2bdf9.cpp 2> file1467411a2bdf9.cpp.err.txt' had status 1
Error in sink(type = "output") : invalid connection

This is being run in RStudio 3.6 with everything updated fully. Has anyone else had this issue and found a workaround? When I run the command without cellSpecificEstimates it works fine.

Thanks,
Ryan

Why do I get contamination fraction >100%

I used the endothelial marker VWF to estimate the contamination fraction for a group of immune cells.

After running
sc <- calculateContaminationFraction(sc=sc, nonExpressedGeneList=nonExpGlist, useToEst = useToEst_clustering)

I got
Estimated global contamination fraction of 114.78%

How could that be possible? Why does the algorithm return a value more than 100% ?

undesirable behavior reading in aggregated 10X data

Neat program you've developed here!

I would like to work form my downsampled data generated with cellranger aggr across several libraries. However, when I used the load10X() function everything is imported as one "Channel". The info is preserved in the barcode endings "-1"...."-10", but the cells are not being imported based on their respective channel/library.

Can we compute soup components and adjustCounts separately?

Thank you for the great tool! I have several questions:

  1. If I decide to setContaminationFraction manually, do I still need to pull the raw_feature_bc_matrix?

  2. If I decide to only plotMarkerDistribution for each sample but not correct the data, does it matter if I use filtered_feature_bc_matrix directly from 10X or it will change the result if load a processed matrix? I have filtered, merged several channels, removed doublets, and attach sample names with the barcode. I wonder will the SoupX need two matrices with matching barcode to recognize each other?

Thanks!

Loading data and downstream steps

Hi @constantAmateur: I am having troubling loading my dataset with load10x for the top level cellranger output.

(1) I can do SoupChannel with tod and toc. But the next step is estimateSoup or setSoupProfile, right? After which I have to add the metadata, such as the clustering, correct?

(2) When loading tod and toc using Seurat::Read10X, what is the purpose of saying "GRCh38"? (on the vignette).

(3) For setclusters to work if I can't upload the cellranger outs folder, I have to put my own clustering info. Is that clustering done on the raw_feature_bc_matrix, for example, using Seurat? Or does it have to be done on both raw and filtered bc matrix?

Thanks a lot for your response, and any additional information that you think will help me. For my dataset, I want to get to a point where I can use autoEstCont.

Loading count matrix

Hi,

I have been using SoupX for my 10X data and it seems to be working great.

But I am having issues with SoupX when I try to load count matrix. I have one count matrix. How do I create subset of it to remove empty droplets?

Here are commands:

tod <- read.delim("Maligant50.csv", row.names = 1, sep=",")    
toc <- read.delim("Maligant50.csv", row.names = 1, sep=",")    

channel1 = SoupChannel(tod,toc,channelName="channel1")
spObj = SoupChannelList(list(channel1))
spObj$channels$channel1 = estimateSoup(spObj$channels$channel1)

I am getting following error:

spObj$channels$channel1 = estimateSoup(spObj$channels$channel1)
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

Am I doing anything wrong?

Many thanks,

[Question] The way to choose soup specific genes

In my project, genes that are highly specific to just one population of cells (Immune, luminal, fibroblast, endothelial) are not known at all.

Thus, I made plot with top 60 candidates of soup specific genes by using "plotMarkerDistribution" described in detailed vignette.
However, it is hard to see bimodal graph expect TGM4 (located middle in top 20 candidates).

  1. which gene could be the soup specific genes with no biomodal graph in this case?

  2. In detailed vignette, all immunoglobulin genes are used since IGCK and IGLC2 were chosen from the plot. However, that is not the case in my data. How can I make the list with TGM4 with no biological information? should I use TGM4 only as the soup specific gene?

I attached my plot for better understanding.

  1. top1-20
    111

  2. top21-40
    222

  3. top 41-60
    333

Thanks,

Error in loading data

Hello!

I'm trying to apply my data on SoupX but I kept getting this error and can not seem to solve it

Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) : 'NA' indices are not (yet?) supported for sparse Matrices

Can you please advise if i might have gone wrong anywhere?

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
Matrix products: default
BLAS: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRblas.so
LAPACK: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SoupX_0.3.0 devtools_1.13.5 Seurat_2.3.0 Matrix_1.2-14
[5] cowplot_0.9.2 ggplot2_3.0.0

Thanks!

Counts 0 and soupest NaN during autoEstCont

Hi,

Thanks a lot for making this lovely package!

I am trying to run SoupX on a scRNAseq dataset and am getting the following error: Error in quantile.default(soupProf$est, soupQuantile) : missing values and NaN's not allowed if 'na.rm' is FALSE. Here is the approach I am using:

  1. Running the Seurat workflow with cell cycle regression and SCTransform to get a processed Seurat object. Then performing dim reduction to annotate the clusters using UMAP/PCA. I am then saving this meta-data with the list of cells and their cluster annotations.
  2. Load in the filtered and raw matrices (CellRanger output) and keep the cells I annotated from (1) for the filtered and raw matrices
  3. Then running:
    sc = SoupChannel(tod, toc)
    sc = setClusters(sc, setNames(HDCM_meta$seurat_clusters, HDCM_meta_cells))
    sc = autoEstCont(sc, soupQuantile=0.2)

When I run autoEstCont I get the error Error in quantile.default(soupProf$est, soupQuantile) : missing values and NaN's not allowed if 'na.rm' is FALSE. After checking the sc object, the counts are all 0 and the est are all NaN. Would appreciate any help! It seems like in the vignette, you guys do not do subset the cells from the raw and filtered matrices?

Thanks!

Setting optimum threshold for contamination

Firstly wanted to say thank you for creating such an excellent package, and a great paper - very intuitive principles using the empty droplets to build a profile of ambient RNA expression!

My question is around what to do when the estimation genes are present in every cell in certain samples. In my case, I am working with gut tissue, and using IG genes to assess contamination. In some gut samples, I keep getting the message from SoupX that there are no non-expressing cells. This is biologically not possible because I have all cell types in my samples (even T-cells haha) unless of course contamination is involved. The below command produces this notification:

estimateNonExpressingCells(sc, nonExpressedGeneList = list(IG = igGenes))

This is upon providing the standard 10X cellranger clustering, and the suggestion from the SoupX package is to set cluster = FALSE. However in other gut samples, the provided 10X cellranger clusters works well. My choice to move forward seems to be either:

i. set an artificially high contamination rate for all samples (say 0.1 as per your vignette)

ii. set cluster = FALSE in all samples, treating each cell as its own cluster (In samples where providing 10X clustering works, i find that setting cluster = FALSE doubles the estimated contamination fraction - but this is actually still well below the 0.1 in option i.)

iii. set artificially high contamination rate for samples (0.1) where I bump in to the 'no non-expressing cell ' problem but provide standard 10x cell ranger clustering info for samples where this is not an issue

iv. set cluster = FALSE instead of setting an artificially high contamination rate for samples where I bump into the the 'no non-expressing cell ' problem, but provide standard 10x cell ranger clustering info for samples where this is not an issue

I am hesitant to pursue options iii. and iv. as these introduce artificial effects on a particular subset of samples by treating these subsets differently (please correct me if I am wrong here). I think part of it is because I believe these 'highly contaminated samples' are inflamed gut tissue, so I think there is a biological reason behind this seperation! So, I would rather treat all samples homogeneously. Do you think this is correct?

The second and more important question then becomes: which is the safer option, to go towards: option i. or ii., especially in knowledge that I need to do differential expression downstream.

Your comment in "#32", suggests you might prefer an artificially high threshold, but I suppose my question to you is, in this setting outlined above would you still pick an artificially high threshold over, setting cluster = FALSE and treating each cell like its own cluster?

Also, in both i. and ii. is it ok to pass clustering information at the adjustCounts step despite not having given clusters for the preceding estimateNonExpressingCells step?

Thank you ever so much again!

Running autoEstCont() gets error message

Hi, I'm trying to use SoupX and do the quickstart.
First I have the 10X data, both raw and filtered matrix.
The "load10X()" function is worked, but when I run "autoEstCont()" it told me I should run clustering first.
So I run

sc = setClusters(sc,[email protected])

Then do

sc = autoEstCont(sc)

But this time I got the error message about "Error in dimnames(x) <- dn: The length of 'dimnames' [1] must be equal to the array range"
How can I fix this problem?

[Question] Is it possible to feed data into a SoupChannel list besides by using Read10X Function?

Hello!

With the new update of SoupX, I noticed that since the soup channel objects get data directly from the cell_ranger_output, if you create a Seurat object to get cell embeddings that has a filtering step like in the standard Seurat pipeline, you get an error when trying to run plotMarkerMap() with useToEst because the sc object and Seurat_DR will have different cell counts. This can be fixed by removing filtering steps in the Seurat pipeline, however that leaves in low quality cells that I'm unaware if they will impact the global contamination.

Thanks!
Casey

How SoupX remove ambient RNA expression when two or more gene sets are used?

Hi,

Thank you for this nice package!

The tutorial gives an example using IG genes as nonExpressedGeneList. I wonder what SoupX would do if two or more gene sets are feed into nonExpressedGeneList?

Say cell 1 is a red blood cell, and cell 2 is a B cell. If we use nonExpressedGeneList = list(HB = c("HBB", "HBA2"), IG = c("IGKC")), and would SoupX use IG genes to estimate cell 1 while HB genes for cell 2? If true, could we go further and say that it would be more accurate if more (accurate) gene sets are given?

I am new to single-cell RNA-seq. Sorry if my question are too naive.

Many thanks!
Yiwei Niu

Estimating and adjusting counts in soupx

Hi,
I need to clean my scRNA seq data.
I tried following the soupx vignette.
But not able to estimate and adjust counts in soupx after loading10X files (both raw and filtered).
It gives the following error:

"Error in autoEstCont(sc) :
Clustering information must be supplied, run setClusters first."

I will appreciate your help.
Thanks in advance.

Error while running strainCells

Hi,

I run into following error while running strainCells.

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed In addition: Warning message: In int2i(as.integer(i), n) : NAs introduced by coercion to integer range

I don't see any NAs introduced in the dataset and it runs fine with smaller datasets.

Can you help in troubleshooting this?

Thanks
shristi

Estimated global contamination fraction and cluster info

Hi,
I think I don't fully understand what Estimated global contamination fraction represents ¿the average conatmination fraction per droplet?
How would cluster information more accurately get rid of contamination ?
Thanks,
Jaime.

multtest not available for R version 3.6.0

Hello,

Just wanted to let you know that one of the dependencies of the updated package is outdated. I haven't tried running on an old form of R yet, but I'll let you know if it works. In case you see this before I do try and run this again, what version of R are you running for SoupX?

Thanks,

Casey

Problem running on jupyter notebook

Hi!

I am running a pipeline in a Jupyter notebook, and I would like to use SoupX inside a R cell. However, each time I run a cell with library(SoupX), the kernel dies.

My installation script is as follows:

conda create -n myenv python==3.7.*
conda activate myenv 

conda install pip
conda install -c conda-forge r-base==3.6.2

conda install nb_conda # install jupyter by default
pip install jupyter_contrib_nbextensions

pip install scanpy
pip install leidenalg
pip install OpenTSNE
pip install rpy2
pip install anndata2ri

conda install pandas==0.25.*

conda install -c conda-forge r-curl==4.3
conda install -c conda-forge r-matrix.utils
conda install -c conda-forge r-sf  # installs hdf5 also

Once in R:

install.packages('usethis')
install.packages('covr')
install.packages('httr')
install.packages('rversions')
install.packages('devtools')

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("multtest")

devtools::install_github('constantAmateur/SoupX',ref='devel')

I use anaconda R to run it in jupyter and I am almost sure that there must be some library or package that breaks the kernel when loading the library, because when running R in the console the package loads correctly.

I know the issue is quite odd for the types of issues that should be received here, but I post it so that you know about it. I will post it in https://github.com/ipython/ipython/issues, so that they can try to solve it.

Thanks!

Performance on 10X v3 libraries

I have used SoupX successfully in the past on 10X v2 libraries, which have far fewer possible barcodes than the v3 chemistry. Calculating cell specific estimates takes a few minutes on my laptop on v2 libraries, but many hours on v3 chemistry and ultimately exhausts available vector memory. Is this a known issue with the current version? Adjusting counts also seems to hang.

I also experienced performance problems on a fresh install of R/Rstudio/SoupX on a new iMac.

Perhaps there is some dependency that is not configured properly on my machines for parallelization?

Opinion on analysis: (1) Auto/Manual; and (2) Before/After doublet removal

Hi,
Thank you for the wonderful program. I would like some opinion on the analysis I performed on my dataset.

My dataset is from PBMCs sequenced using the 10X 5'-GEX kit. I can see that there are very obvious background contamination from RBCs, so I tried to remove them.

If you see the attached image, you can see I tried 2 different parameters. For the auto, I used the default, automated SoupX setting while for manual, I manually specify the list of RBC genes only as possible source of background.

After correction, background RBC signal was reduced but to different efficiency. With manual correction, all background signals in all cell clusters except RBC were removed, but with auto, I still see some signal in 1 specific island.

So my questions are:

  1. In this case, which is the 'better' correction method? Would you suggest the automated or manual method?
  2. As I mentioned, there's an island which still express some RBC genes after automated method. That island strangely does not express any known immune cell marker but express CD45 which is a pan-immune cell marker. The same island / cluster of cells can be seen in manual correction as well (circled in blue). Could these potentially be doublets / multiplets? In this case, do you recommend doublet removal first before removing ambient RNA?
  3. In the case of PBMCs, what other gene sets do you recommend to remove ambient RNA? As this is sequenced on a 5'-GEX kit, the capture of TCR / Ig genes are a bit poor as I could not really get good expression of these genes, so they're not good candidates. I tried using canonical markers like CD3 groups for T-cells, but results are not so impressive.

Hope to hear back from you soon. Thank you very much.

SoupX_test

Using SoupX for dropseq data

Hi,
I would like to use SoupX for my dropseq data. Is it possible to just plug the dgTMatrix to estimate Soup?

Best,
Yolanda

decontaminating pancreatic dataset

Hi,
I am using SoupX to decontaminate a dataset of pancreatic cell in which acinar enzymes are contaminating non acinar cells. I have used soup specific genes to determine the fratuion of contamination and correcting the expression profile as follows:

WT_36Dir<- c("/local/ljmartinezv/sc_pancreas_M_Serrano/Final_analysis/WT/AL4936/")
WT_36_CellID <- read.table('WT_36_CELLS', header = FALSE, sep= '\t')
WT_36 <- load10X(dataDir = WT_36Dir, cellIDs = WT_36_CellID$V1, keepDroplets = TRUE)
WT_36 <- estimateSoup(WT_36)

Soup specific genes

Soup_genes_36 <- head(WT_36$soupProfile[order(WT_36$soupProfile$est, decreasing = TRUE), ], n = 50)
Soup_genes_36 <- rownames(Soup_genes_36)

Estimating non-expressing cells

useToEst_36 = estimateNonExpressingCells(WT_36, nonExpressedGeneList = list(Soup_genes_36))

Calculating the contamination fraction

WT_36 <- calculateContaminationFraction(WT_36, list(Soup_genes_36), useToEst = useToEst_36)

estimated global contamination fraction of 37.60%

Correcting expression profile

WT_36_decont <- adjustCounts(WT_36)

DropletUtils:::write10xCounts("./WT_36Counts", WT_36_decont)

Does that looks fine to you ?
Thanks in advance,
Jaime.

NaNs in soupMatrix with custom count matrix

I am using gene x cell count matrices as input in the following way:

tod <- read.delim(
  "../0.Data/SoupX_input/V_1/raw_gene_bc_matrices/GRCh38/matrix.csv",
  row.names = 1, sep="\t")  

toc <- read.delim(
  "../0.Data/SoupX_input/V_1/filtered_gene_bc_matrices/GRCh38/matrix.csv",
  row.names = 1, sep="\t", header=TRUE)

tod <- as.matrix(tod)
toc <- as.matrix(toc)

channel1 = SoupChannel(tod, toc, channelName="Channel1", keepDroplets = TRUE)

scl = SoupChannelList(list(channel1))

There are no errors or warnings, however in the scl object, the scl$soupMatrix matrix constains only NaN, whereas the scl$soupMatrix matrix of the demo has numbers.

As an alternative, I created a barcodes.tsv, genes.tsv, and matrix.mtx files, and loaded them using:

scl = load10X(dataDirs)

The files get loaded, however the soupMatrix object in scl is also full of NaN.

@constantAmateur do you have any insights on how to solve this?

Thank you
A

toc data in strainChannel.R

First, thank you for making SoupX -- the method is elegant, and it seems to work very nicely!

I couldn't get strainChannel to work until I made this little fix to make sure that toc was actually defined inside the function:

diff --git a/R/strainChannel.R b/R/strainChannel.R
index 82a4a09..d4f36d3 100644
--- a/R/strainChannel.R
+++ b/R/strainChannel.R
@@ -20,6 +20,7 @@ strainChannel = function(tod,cellIdxs,nonExpressedGeneList,soupRange=c(0,10),...
   trueCellExpression = strainCells(tod[,cellIdxs],cellRhos,soupProfile)
   #And also calculate the ratio of the observed counts to the soup.  Kept un-logged so it remains sparse
   #We should really convert a bunch of 0s to NaNs, but don't do this to save space.
+  toc <- tod[, cellIdxs]
   expressionRatio = t(t(toc)/colSums(toc))
   expressionRatio@x = expressionRatio@x/soupProfile[expressionRatio@i+1,'est']
   #Now return everything that we've calculated

One error when I loading data by using SoupX, Thank you so much.

Dear Dr. Matthew Young,

It is nice to try to reduce the ambient mRNA in scRNA-seq analysis. I tried to read my 10xgenomics scRNAseq data as below, but encountered an error as below, could you please help to check and give me some tips? Is that because of I loading the folder of aggregated samples by using Cellranger -aggr? Thank you so much.

#===#

dataDirs=c("/scRNAseq/BeforeForceCell/Mixtured/outs")
scl = load10X(dataDirs)
Loading data for 10X channel Channel1 from /scRNAseq/BeforeForceCell/Mixtured/outs
Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) :
'NA' indices are not (yet?) supported for sparse Matrices

#===#

Best,
Qi

Error with CreateCleanedSeurat function

I get the following error when I try to create cleaned seurat object

Error in createCleanedSeurat(scl) :
no slot of name "calc.params" for this object of class "Seurat"

plate-based scRNAseq data

Hi SoupX team!

  1. I would like to use SoupX on plate-based scRNAseq where we included empty wells (so that we can use them similar to empty droplets). I guess there is no reason to assume that the model would not work on this type of data?

  2. Attempting this, I called the SoupChannelList constructor directly with my data (as the 10x input handling function is not appropriate here) and I got an error in inferNonExpressedGenes().

rm(list=ls())
library(Matrix)
library(Seurat)
library(SoupX)
dir_in <- my path
dir_our <- my path
cnts <- t(readMM(paste0(dir_in, "counts_adata_proc.mtx")))
obs <- read.csv(paste0(dir_in, "obs_adata_proc.csv"), as.is = TRUE)
chips <- unique(obs$chip_id) # The plates.
iscell <- Matrix::colSums(cnts) >= 5000 # Cut off for empty wells (this is higher than in 10x data).
scl <- SoupChannelList(
       channels=lapply(chips[1:2], function(chip){
       SoupChannel(tod = as.matrix(cnts[,obs$chip_id==chip]),
                 toc = as.matrix(cnts[,obs$chip_id==chip & iscell]),
                 channelName = chip,
                 soupRange = c(0,5000), # Cut off for empty wells (this is higher than in 10x data).
                 keepDroplets = TRUE)
   }))
scl <- inferNonExpressedGenes(scl)

Inferring non-expressed genes for channel chip1
Error in split.default(rat@x, rownames(rat)[rat@i + 1]) :
group length is 0 but data length > 0

Do you have any intuition as to why this could happen? I cannot share the data unfortunately.
Session info for this example:

sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin17.5.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SoupX_0.2.3 Seurat_2.3.2 cowplot_0.9.2 ggplot2_2.2.1 Matrix_1.2-14
loaded via a namespace (and not attached):
[1] diffusionMap_1.1-0 Rtsne_0.13 VGAM_1.0-5 colorspace_1.3-2 ggridges_0.5.0
[6] class_7.3-14 modeltools_0.2-21 mclust_5.4 htmlTable_1.11.2 base64enc_0.1-3
[11] proxy_0.4-22 rstudioapi_0.7 DRR_0.0.3 bit64_0.9-7 flexmix_2.3-14
[16] prodlim_2018.04.18 mvtnorm_1.0-8 lubridate_1.7.4 ranger_0.10.1 codetools_0.2-15
[21] splines_3.5.0 R.methodsS3_1.7.1 mnormt_1.5-5 robustbase_0.93-0 knitr_1.20
[26] tclust_1.4-1 RcppRoll_0.3.0 jsonlite_1.5 Formula_1.2-3 caret_6.0-80
[31] ica_1.0-2 broom_0.4.4 ddalpha_1.3.3 cluster_2.0.7-1 kernlab_0.9-26
[36] png_0.1-7 R.oo_1.22.0 sfsmisc_1.1-2 compiler_3.5.0 backports_1.1.2
[41] assertthat_0.2.0 lazyeval_0.2.1 lars_1.2 acepack_1.4.1 htmltools_0.3.6
[46] tools_3.5.0 bindrcpp_0.2.2 igraph_1.2.1 gtable_0.2.0 glue_1.2.0
[51] RANN_2.5.1 reshape2_1.4.3 dplyr_0.7.4 Rcpp_0.12.16 trimcluster_0.1-2
[56] gdata_2.18.0 ape_5.1 nlme_3.1-137 iterators_1.0.9 fpc_2.1-11
[61] lmtest_0.9-36 psych_1.8.4 timeDate_3043.102 gower_0.1.2 stringr_1.3.1
[66] irlba_2.3.2 gtools_3.5.0 DEoptimR_1.0-8 zoo_1.8-1 MASS_7.3-50
[71] scales_0.5.0 ipred_0.9-6 doSNOW_1.0.16 parallel_3.5.0 RColorBrewer_1.1-2
[76] yaml_2.1.19 reticulate_1.8 pbapply_1.3-4 gridExtra_2.3 segmented_0.5-3.0
[81] rpart_4.1-13 latticeExtra_0.6-28 stringi_1.2.2 foreach_1.4.4 checkmate_1.8.5
[86] caTools_1.17.1 lava_1.6.1 geometry_0.3-6 dtw_1.20-1 SDMTools_1.1-221
[91] rlang_0.2.0 pkgconfig_2.0.1 prabclus_2.2-6 bitops_1.0-6 lattice_0.20-35
[96] ROCR_1.0-7 purrr_0.2.4 bindr_0.1.1 recipes_0.1.2 htmlwidgets_1.2
[101] bit_1.1-13 tidyselect_0.2.4 CVST_0.2-2 plyr_1.8.4 magrittr_1.5
[106] R6_2.2.2 snow_0.4-2 gplots_3.0.1 Hmisc_4.1-1 dimRed_0.1.0
[111] withr_2.1.2 pillar_1.2.2 foreign_0.8-70 mixtools_1.1.0 fitdistrplus_1.0-9
[116] survival_2.42-3 scatterplot3d_0.3-41 abind_1.4-5 nnet_7.3-12 tsne_0.1-3
[121] tibble_1.4.2 hdf5r_1.0.0 KernSmooth_2.23-15 grid_3.5.0 data.table_1.11.2
[126] FNN_1.1 ModelMetrics_1.1.0 metap_0.9 digest_0.6.15 diptest_0.75-7
[131] tidyr_0.8.0 R.utils_2.6.0 stats4_3.5.0 munsell_0.4.3 magic_1.5-8

Thanks for your help!
David

PlotMarkerDistribution Error

Hello,
I am trying to use the PlotMarkerDistrbution function to infer the non-expressedgene list . However, i keep getting an error regarding indexing. I believe the error is popping up because i am switching out the toc matrix in the SoupChannel with the filtered Seurat count matrix. This filtered matrix has less number of genes than the raw count matrix.
Now i tried only filtering out the cellids in the Cellranger filtered matrix that are not used in seurat object to keep the same gene counts as in the ambient profile or soup. But then the resulting graphs from the plotmarkemap changes drastically.
Can i possibly reduce the raw gene matrix gene numbers to the same as the one in the seurat object, or is that biasing the analysis?

Thanks,
Devika

\rho_c MLE deriviation

Can you please tell me how shall I derive the MLE of \rho_c in 6.2.2 formula (5) of the manuscript?

Really appreciate for your help.

Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins

Hey Matthew,

thanks for the great package. I'm looking for a simple and safe way to implement SoupX package as part of a standard single cell analysis pipeline. Steps would be:

  1. Cell Ranger [Counting and clustering]
  2. SoupX [Soup correction]
  3. Seurat [Analysis]

This means all steps have to be automated, and it has to be reasonably fast. Therefore, I would like a rough (under)estimate of the contamination fraction without biological assumptions and manual tweaking. I largely read the vignette and the manuscript, and I understand your points on manual inspection and gene selection. That being said, I need automation. As you are much more experienced in this topic, I would appreciate your opinion:

  1. Cell Ranger does clustering and differential gene expression analysis (if selected). Would it be possible to use these genes [Cell Ranger output folder]/analysis/diffexp/graphclust/differential_expression.csv and clusters [Cell Ranger output folder]/analysis/clustering/graphclust/clusters.csv to reliably estimating the contamination fraction? If so, what would be your approach?

  2. Additionally, could the total read counts in empty droplets VS. the total read counts in cell containing droplets be used to estimate contamination? (This is probably discussed, and I overlooked it)

  3. Alternatively, would simply adding ERCC spike-ins to the single cell suspension (right before loading it to the 10X controller) be a quick experimental fix? This would generate a set of genes that should be expressed in none of the cells.


On another note, I found that sc$isV3 checking the 10x kit version. As now all experiments run on v3+. What is the difference in processing these? v3 seems to use some information from the Soup to call cells.

Apologies for questions stemming from my ignorance & thanks for reading.

Spurious DEGs introduced by SoupX?

I introduced SoupX into my pipeline after finding that certain genes were ubiquitously differentially expressed between conditions across all cell types in my dataset and suspecting that this was due to ambient RNA. SoupX greatly improved the consistency of clustering across different resolutions after integrating the samples. However, I have noticed that many of the new DEGs appear to be spurious - for example, Plp1, a gene that is expressed specifically in oligodendrocytes, is now DE in microglia between conditions - expressed in 64% and 75.5% of microglia in the two conditions respectively!

Plp1, and the other spurious DEGs, are highly expressed in the soup and among those most adjusted by SoupX. I suspect that these DEGs are introduced due to inconsistencies in processing between samples - a larger fraction of the soup was (presumably) removed in some samples than others. Comparing my results with the literature, I believe that previously the DEGs were genuinely DE between conditions, but not necessarily DE in those specific cells, whereas now some of the DEGs are neither genuinely DE between conditions nor genuinely expressed in those cells.

I used the automated pipeline and the estimated global rho for each of my samples was 4-6%. Clearly this is an underestimate as there is a substantial amount of ambient RNA remaining and so I am looking into the manual estimation method. However, I'm not sure how to guarantee that all of the samples are adjusted to an equivalent degree. I could set the contamination fraction to be the same for every sample, but I'm not sure whether this would be better or worse as some samples may contain a higher proportion of ambient RNA than others.

I suspect that part of the issue may be that some empty drops contained a similar amount of RNA as some real cells and so were not correctly filtered out by CellRanger. I usually remove these after SoupX. Will removing these cells and then using DropletUtils:::write10xCounts to replace the files in the outs/filtered_feature_bc_matrix folder be sufficient to provide the correct inputs for SoupX?

I'm currently trying different processing methods to see how the output is affected, but in the meantime I would really like to know whether you have encountered the spurious differential expression problem before and if you have any advice on how to resolve it.

Presence of ambient RNA after correction ?

Hello,

Thank you for the package and the detailed vignette!

I followed the steps to estimate and correct for ambient RNA using a set of genes (manual estimation). I work with tumor samples (tumor cells + normal cells from the microenv).

I chose to use 3 sets of genes to manually estimate the contamination. These genes should be expressed exclusively in a subset of the cells:

HBgenes=c("HBB", "HBA2") #-> expressed in erythrocytes
NORgenes=c("PHOX2B") #-> a transcription factor involved in the development of noradrenergic neuron populations
ENDOgenes=c("ENG", "PTPRB") #->expressed in endothelial cells

Here's a look at the data before running SoupX, I just plot a tsne + the expression of my favourite genes (I used Seurat).

image1 tsne geneExp TD2

We can clearly see the contamination, especially with HBB which seems to be expressed within the PHOX2B+ population.

First of all, here's the output of plotMarkerMap to check the expression of contaminating genes:

image2 plotMarkerMap TD2

I have two questions here:

  1. Expression of PHOX2B seems to be slightly different from the background (light reddish colour) and only a few cells are outlined in green indicating a significant difference. Does this mean we are probably dealing with a strong PHOX2B contamination from the "soup"?

  2. For ENG gene, log2Ratio is very high everywhere but only a few cells show statistically significant difference (outlined in green). I wonder how we should interpret the cells that don't cluster together (no green outline + distributed in the middle). They may be be doublets? This is perhaps beyond the scope of SoupX usage, but if you have any thoughts on this, I'd be glad to hear them!

Finally, here's the output of plotMarkerMap to show the cells on which the estimations are done:

image3 useToEst TD2

and the reported contamination fraction:
Estimated global contamination fraction of 9.53%

After running SoupX, I carry on with the analysis using the corrected counts (dimensionally reduction, normalization, clustering..). The first thing I noticed is the presence of background expression of HBB gene away from, what seems to be, the true cluster of erythrocytes (cluster 4). So I am still suspecting the presence of contamination.

image6 umap TD2
image4 HBB umap TD2
image5 HBB exp TD2

  1. Am I correct to suspect the persistence of contamination in some clusters? or is this level of expression too low to be worried about?

Many thanks in advance for the help!
Best,
Amira

Loading Data using load10X function

Hello!

I've been trying to load my data into R, however because I received the cellranger output through a dropbox, I don't think the function can establish a connection to read the files. Could someone who has successfully loaded data through this function share with me their file names and what their output looked like for cellranger V3?
Here's my session info and errors:

dataDirs = c('E:/Casey/data/twentyfour_Month_Raw_Data/twenty_four_month_data/raw_data_matrix_included')
scl = load10X(dataDirs, keepDroplets = TRUE)
Loading data for 10X channel Channel1 from E:/Casey/data/twentyfour_Month_Raw_Data/twenty_four_month_data/raw_data_matrix_included
Error in open.connection(file) : cannot open the connection
In addition: Warning message:
In open.connection(file) :
cannot open file 'E:/Casey/data/twentyfour_Month_Raw_Data/twenty_four_month_data/raw_data_matrix_included/raw_feature_bc_matrix/matrix.mtx.gz': No such file or directory

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Thanks!

Error correcting expression profile using adjustCounts

Hi, thanks for the nice tool to correct ambient RNA expression!

I've been following the vignette, and everything has been going well up till the correction of the expression profile using adjustCounts, rho was estimated fine.

However upon running adjustCounts I get the following error message:

Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Warning message in tgt * ws <= bucketLims:
"longer object length is not a multiple of shorter object length"
Warning message in bucketLims/ws:
"longer object length is not a multiple of shorter object length"
Error: Matrices must have same dimensions in e1 - Matrix(e2)
Traceback:

  1. adjustCounts(sc)
  2. adjustCounts(tmp, clusters = FALSE, method = method, roundToInt = FALSE,
    . verbose = verbose, tol = tol, pCut = pCut)
  3. out - do.call(cbind, lapply(seq(ncol(out)), function(e) alloc(expSoupCnts[e],
    . out[, e], soupFrac)))
  4. out - do.call(cbind, lapply(seq(ncol(out)), function(e) alloc(expSoupCnts[e],
    . out[, e], soupFrac)))
  5. callGeneric(e1, Matrix(e2))
  6. eval(call, parent.frame())
  7. eval(call, parent.frame())
  8. e1 - Matrix(e2)
  9. e1 - Matrix(e2)
  10. dimCheck(e1, e2)
  11. stop(gettextf("Matrices must have same dimensions in %s", deparse(sys.call(sys.parent()))),
    . call. = FALSE, domain = NA)

I don't know where the problem lies unfortunately... Any help troubleshooting would be appreciated.

I tried checking the structure of my CellSoup object (using str(cs)) and got the following output:

List of 6
$ toc : num [1:22390, 1:2951] 0 0 0 0 0 0 1 0 0 0 ...
..- attr(, "dimnames")=List of 2
.. ..$ : chr [1:22390] "Xkr4" "Gm1992" "Rp1" "Sox17" ...
.. ..$ : chr [1:2951] "EXT145_AAACCTGCACGGCGTT" "EXT145_AAACCTGCAGACGCAA" "EXT145_AAACCTGCATCGGAAG" "EXT145_AAACCTGTCATGGTCA" ...
$ metaData :'data.frame': 2951 obs. of 5 variables:
..$ nUMIs : num [1:2951] 4837 4906 13139 5049 5500 ...
..$ clusters: chr [1:2951] "Cortico" "Lacto" "EC2" "Lacto" ...
..$ UMAP_1 : num [1:2951] 8.93 1.68 10.17 3.26 0.7 ...
..$ UMAP_2 : num [1:2951] -8.53 1.15 3.89 -3.95 -2.23 ...
..$ rho : num [1:2951] 0.018 0.018 0.018 0.018 0.018 0.018 0.018 0.018 0.018 0.018 ...
$ nDropUMIs : Named num [1:737280] 7 0 1 1 1 1 15 1 1 43 ...
..- attr(
, "names")= chr [1:737280] "AAACCTGAGAAACCAT-1" "AAACCTGAGAAACCGC-1" "AAACCTGAGAAACCTA-1" "AAACCTGAGAAACGAG-1" ...
$ soupProfile:'data.frame': 28692 obs. of 2 variables:
..$ est : num [1:28692] 1.47e-06 1.47e-06 0.00 0.00 0.00 ...
..$ counts: num [1:28692] 4 4 0 0 0 13 0 82 72 0 ...
$ DR : chr [1:2] "UMAP_1" "UMAP_2"
$ fit :List of 6
..$ dd :'data.frame': 1300 obs. of 14 variables:
.. ..$ gene : Factor w/ 100 levels "1810011O10Rik",..: 37 98 30 25 66 13 89 51 8 11 ...
.. ..$ passNonExp : logi [1:1300] TRUE TRUE TRUE TRUE TRUE TRUE ...
.. ..$ rhoEst : num [1:1300] 0 0 0.0435 0 0 ...
.. ..$ rhoIdx : int [1:1300] 1 1 8 1 1 1 7 1 8 1 ...
.. ..$ obsCnt : num [1:1300] 0 0 1 0 0 0 2 0 1 0 ...
.. ..$ expCnt : num [1:1300] 32.4 37.6 23 34.1 33.4 ...
.. ..$ isExpressedFDR: num [1:1300] 1.54e-11 2.57e-13 4.39e-07 3.97e-12 7.00e-12 ...
.. ..$ geneIdx : int [1:1300] 6 7 8 12 13 14 16 20 25 29 ...
.. ..$ tfidf : num [1:1300] 3.45 3.45 3.42 3.36 3.36 ...
.. ..$ soupIdx : int [1:1300] 1414 1211 2027 1345 1371 1982 557 1115 2075 1478 ...
.. ..$ soupExp : num [1:1300] 8.22e-05 9.54e-05 5.83e-05 8.66e-05 8.47e-05 ...
.. ..$ useEst : logi [1:1300] TRUE TRUE TRUE TRUE TRUE TRUE ...
.. ..$ rhoHigh : num [1:1300] 0.114 0.0982 0.2426 0.1082 0.1106 ...
.. ..$ rhoLow : num [1:1300] 0 0 0.0011 0 0 ...
..$ priorRho : num 0.05
..$ priorRhoStdDev: num 0.1
..$ posterior : num [1:1001] 0 5.96 7.25 8.05 8.64 ...
..$ rhoEst : num 0.018
..$ rhoFWHM : num [1:2] 0.01 0.043

  • attr(*, "class")= chr [1:2] "list" "SoupChannel"

At first glance everything looks fine to me? Googling the error message didn't help me further..

Thanks!

autoEstCont always reduces estimation

I have been using the estimateNonExpressingCells and calculateContaminationFraction to calculate the contamination fraction from nuclei by using mitochondrial genes. However, this occasionally gave very high contamination estimates. When the autoEstCont function was added, I began using that with the prior rho equal to the estimate from using mitochondrial genes. This seemed to solve the high estimate issue.

This was working well until I switched to using v1.4.5. Now, the autoEstCont function always lowers the estimated contamination, sometimes drastically. I would expect some samples to go up and some to go down after using the autoEstCont function. The reduction in the estimate also seems to scale linearly when adjusting the prior SD. Has anyone else had this issue?

plotMarkerDistribution(sc) error

Can you please tell me what are the possible bugs with my code?

plotMarkerDistribution(sc)


Error in estimateNonExpressingCells(sc, as.list(gns), ...): Invalid cluster specification.  clusters must be a named vector with all column names in the table of counts appearing.
Traceback:

1. plotMarkerDistribution(sc)
2. inferNonExpressedGenes(sc, ...)
3. estimateNonExpressingCells(sc, as.list(gns), ...)
4. stop("Invalid cluster specification.  clusters must be a named vector with all column names in the table of counts appearing.")

custom count matrixes

Hi Mathew,
I just starting using your method and I was wondering if it is possible to upload custom counts matrixes. I am using 10X samples but for some benchmarks instead of using cellranger output I have my own matrixes. Is it possible? Thanks.

Nice work

E

Removal of general contaminants

Hello,

Thanks for creating the SoupX, it seems like a nice package for accounting for the background noise in 10X data.
I have a small question though. I have two single cell samples of a tissue from healthy and diseased patient, and I know a bunch of cell-type specific genes that contaminate the "soup". I could successfully remove them with your algorithm.
However, I also see in each sample a number of general contaminants (like lncRNAs and splicing regualators, most of them are also highly expressed), that are sample-specific. I assume that they are contaminants, as I see them in the "empty droplets". When I look for DE genes between the samples, I of course find those contaminants in the top of the lists for all the clusters, but as I said, I assume they are just artifacts.
I haven't fully understood how to deal with those general contaminant genes that are present in all cell types in the SoupX. Do I just run calculateContaminationFraction() on the list of these genes on all the cells I have? I tried that, but it didn't seem to do much. Or is there a different way to handle this?

Thanks in advance!

Installation failure with STAN

Hi,
Thank you for writing this software. My analysis pipeline requires removal of cells with high contamination, so I use the option cellSpecificEstimates = T
I updated to R version 4.0.1 and reinstalled SoupX using install.packages('SoupX'), which does not include the option. So then I tried devtools::install_github("constantAmateur/SoupX",ref='STAN'), but this results in an error:

Downloading GitHub repo constantAmateur/SoupX@STAN
Skipping 2 packages not available: ggplot2, Seurat
   checking DESCRIPTION meta-information ...m/5fkddq9119v31_40dqj5pf2m0000gn/T/RtmpS3cUlW/remotes2b0e89d3ad0/constantAmateur-SoupX-a01ddb0/DESCRIPTION’ ...
* installing *source* package ‘SoupX’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
sh: line 1: 11153 Killed: 9               R_TESTS= '/Library/Frameworks/R.framework/Resources/bin/R' --no-save --no-restore --no-echo 2>&1 < '/var/folders/lm/5fkddq9119v31_40dqj5pf2m0000gn/T//Rtmp6TouEu/file2b877e5ca7e5'
ERROR: lazy loading failed for package ‘SoupX’
* removing ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library/SoupX’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library/SoupX’
Error: Failed to install 'SoupX' from GitHub:
  (converted from warning) installation of package ‘/var/folders/lm/5fkddq9119v31_40dqj5pf2m0000gn/T//RtmpS3cUlW/file2b0e66598fff/SoupX_1.4.5.tar.gz’ had non-zero exit status

This is on macOS Catalina with R version 4.0.1, I got the error in RStudio as well as R.

Integrating result with Scanpy

Hello!

From you paper you mention that there's a way to integrate SoupX's result with Seurat and Scanpy.

I noticed that there is a function (createCleanedSeurat.R) for Seurat. is there such function for Scanpy?

Thanks!

Error loading 10x data

Hello, I am pretty new to this R world, but I am trying to clean up some data and found SoupX to be the solution. I am trying to follow the vignette provided with the package using my own data and as soon as I tried to run "sc = load10X(dataDirs)", I get the following error:

sc = load10X(dataDirs)
Loading raw count data
Error in full.data[[1]] : subscript out of bounds

Can someone help me figuring out the problem, please?
Thanks

Request for users to specify more ggplot arguments

Hi there,

Thanks for developing SoupX, and it has been a pleasure to try it out so far! Could you allow the users to specify ggplot arguments in the self-defined functions, such as point size, alpha, etc? This would be super helpful!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.