GithubHelp home page GithubHelp logo

single-cell's Introduction

Single-cell

Sequencing(10.1038/s41587-020-0465-8)

10x Chromium

  1. Strongest consistent performance
  2. Less time

Smart-seq2

  1. cost much

Setup data

Barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz

1.Barcodes.tsv.gz: cell label(colnames) 2.Features.tsv.gz: gene id(rownames=umi) 3.Matrix.mtx.gz: expression data.

These three files are generated by cellranger, which is used to align reads and generate feature-barcode matrices

library(dplyr)
library(Seurat)
library(patchwork)
# Load the PBMC dataset, seurat uses a sparse-matrix to save memory and enhance speed.(object.size)
count <- Read10X(data.dir = "./exampledata/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <-
  CreateSeuratObject(
    counts = count,
    project = "pbmc3k",
    min.cells = 0,
    min.features = 0
  )

Count file

scRNA <- CreateSeuratObject(counts = counts)

H5 file

H5 is a data file saved in the Hierarchical Data Format (HDF). It contains multidimensional arrays of data.

sce <- Read10X_h5("filename = xxx_matrices_h5.h5")
sce <- CreateSeuratObject(counts = sce)

H5ad file

1.convert h5ad to h5seurat 2.load seurat object by "loadh5seurat" function

library(SeuratDisk)
Convert("xxx_raw_counts.h5ad", "h5seurat",
        overwrite = TRUE,assay = "RNA")
scRNA <- LoadH5Seurat("GSE153643_RAW/GSM4648565_liver_raw_counts.h5seurat")

GOAL: To ensure that only single, live cells are included in downstream analysis(doi: 10.1186/s13059-016-0888-1) pbmc_metadata orig.ident: Each sample has a kind of orig.ident and total number of same orig.ident on behalf of total number of cells (automatically set to active ident)(barcode)

nCount_RNA: number of count(UMI) in each cell

nFeature_RNA: number of genes in each cell

QC and selecting cells for further analysis

Detecting the number of genes and counts in each cells

Low-quality cells or empty droplets often have very few genes

Cell doublets or multiplets may exhibit an aberrantly high gene count (Doublets: a droplet in droplet-based sequencing that has captured atleast 2 cells. Multipletsmixing: this sample with other labeled samples)

These numbers of unique genes and total molecules(umi/count) were automatically calculated during CreateSeuratObject() and stored in metadata.

The cutoff of nFeature_RNA and nCount_RNA depend on vlnPlot.(https://hbctraining.github.io/scRNA-seq/lessons/04_SC_quality_control.html)??

Detecting the percentage of mitochondrial genome in all genome

Low-quality / dying cells often exhibit extensive mitochondrial contamination or a higher percentage of counts of mitochondrial genes. This is because some RNA in nucleus flow out through cell membrane due to changes in permeability when cell dies. However, RNA in mitochondria is fixed in mitochondria, and thus the ratio of RNA in mitochondria increases. Thus, we should set a cutoff of percent.mt depends on distinctive tissues and cell situations. For example, the cutoff for tumors is 5, and the cutoff for liver and heart is 80 and 60 as there are more mitochondria in heart.

Using PercentageFeatureSet() function to calculate the percentage of mitochondrial gene in cell, which be able to identify dying cells or low- quality cells.

(The column sum of the counts for features belonging to the set / The column sum for all features))*100
(https://www.rdocumentation.org/packages/Seurat/versions/3.1.3/topics/PercentageFeatureSet)

# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Visualizing QC metrics and filtering cells

#Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores etc.(pearson correlation)(https://satijalab.org/seurat/reference/featurescatter)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") 
plot1 + plot2
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Normalizing the data

GOALS:To remove the non-biological variation as much as possible, e.g. various forms of bias or noise present in the sequencing process(https://doi.org/10.3389/fgene.2020.00041)

LogNormalize:

Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor(10,000 by default). This is then natural-log transformed using log1p.

Formula:log1p(value/colSums[cell-x] *scale_factor). log1p means log(x + 1)

SCTransform(10.1186/s13059-019-1874-1)

  1. A single scaling factor does not effectively normalize both lowly and highly expressed genes. Genes with different overall abundances exhibited distinct patterns after log-normalization, and only low/medium abundance genes in the bottom three tiers were effectively normalized. log
  2. Moreover, gene variance was also confounded with sequencing depth after log-normalization.
  3. Single cell count data can be overfit by a standard (two-parameter) NB distribution overfit for NB

Identification of highly variable features (feature selection)

High cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others)

single-cell's People

Contributors

1qiguo avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.