GithubHelp home page GithubHelp logo

Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins about soupx HOT 2 CLOSED

vertesy avatar vertesy commented on August 22, 2024
Estimating the contamination fraction without biological assumptions using Cell Ranger's default clustering results or ERCC spike-ins

from soupx.

Comments (2)

constantAmateur avatar constantAmateur commented on August 22, 2024

I suggest you have a look at the devel branch of the code which includes a routine to automate contamination estimation. This is still in the devel branch as it has not been widely tested, so I would appreciate your feedback on its performance if you do use it.

The automated workflow would be:

sc = load10X(dat)
contEst = autoEstCont(sc)
sc = setContamination(sc,contEst$rhoEst[1])
correctedCounts = adjustCounts(sc)

With extra steps if you load your data in a different way.

The alternative if you must have automation would be to set some fixed contamination fraction for all channels using setContamination. I would actually recommend trying to set this too high rather than too low (e.g 15-20%) as the biologically important things (the marker genes) are highly expressed relative to the background and basically never removed. This is a risky approach though, so if you're doing this you should make sure you see how your results depend on the constant value you use.

ERCC spike-ins are a potential solution to this, but in practice it is too hard to ensure that you load the right spike-in concentration. Too little spike in and you get no benefit, too much and you spend all your money sequencing spike-ins.

The isV3 tag doesn't do anything after the data is loaded. It's just used to use the file format approapriate to the cellranger version within load10X

from soupx.

mesnger avatar mesnger commented on August 22, 2024

Hello, all.
I have quite a similar question as with vertesy.

I tried multiple gene list as "single list" or "combination of lists" for CalculateContaminationFraction and got quite a broad range of outputs (from 1.1% contamination rate to 10.44%).
I selected gene lists by differential_expression.csv provided by Cell Ranger and/or Seurat FindAllMarkers. But as the output varies greatly by each gene lists, I could not settle on one result.

Do you have any criteria for the best input gene? Following your tutorial, I assume that "Known Gene Marker " is your go to choice, but when it is not known, what gene should I prioritize on?

A. Genes "exclusively expressed" in a subset of clusters(other clusters have near 0 value)
B. Genes "highly expressed" in a subset of clusters
C. Genes with "highest difference" in expression value compared to other cluster
D. Genes with "high p-value" calculated by some DEG method

or is by considering all of above? (which might not lead to a sound result with real life data)

Thanks for making a great tool.
Hope to get a reply soon.

from soupx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.