Hey Matthew, thanks for the great package. I'm looking for a simple

Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins about soupx HOT 2 CLOSED

vertesy commented on August 22, 2024

Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins

from soupx.

Comments (2)

constantAmateur commented on August 22, 2024

I suggest you have a look at the devel branch of the code which includes a routine to automate contamination estimation. This is still in the devel branch as it has not been widely tested, so I would appreciate your feedback on its performance if you do use it.

The automated workflow would be:

sc = load10X(dat)
contEst = autoEstCont(sc)
sc = setContamination(sc,contEst$rhoEst[1])
correctedCounts = adjustCounts(sc)

With extra steps if you load your data in a different way.

The alternative if you must have automation would be to set some fixed contamination fraction for all channels using setContamination. I would actually recommend trying to set this too high rather than too low (e.g 15-20%) as the biologically important things (the marker genes) are highly expressed relative to the background and basically never removed. This is a risky approach though, so if you're doing this you should make sure you see how your results depend on the constant value you use.

ERCC spike-ins are a potential solution to this, but in practice it is too hard to ensure that you load the right spike-in concentration. Too little spike in and you get no benefit, too much and you spend all your money sequencing spike-ins.

The isV3 tag doesn't do anything after the data is loaded. It's just used to use the file format approapriate to the cellranger version within load10X

from soupx.

mesnger commented on August 22, 2024

Hello, all.
I have quite a similar question as with vertesy.

I tried multiple gene list as "single list" or "combination of lists" for CalculateContaminationFraction and got quite a broad range of outputs (from 1.1% contamination rate to 10.44%).
I selected gene lists by differential_expression.csv provided by Cell Ranger and/or Seurat FindAllMarkers. But as the output varies greatly by each gene lists, I could not settle on one result.

Do you have any criteria for the best input gene? Following your tutorial, I assume that "Known Gene Marker " is your go to choice, but when it is not known, what gene should I prioritize on?

A. Genes "exclusively expressed" in a subset of clusters(other clusters have near 0 value)
B. Genes "highly expressed" in a subset of clusters
C. Genes with "highest difference" in expression value compared to other cluster
D. Genes with "high p-value" calculated by some DEG method

or is by considering all of above? (which might not lead to a sound result with real life data)

Thanks for making a great tool.
Hope to get a reply soon.

from soupx.

Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins about soupx HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs