GithubHelp home page GithubHelp logo

epigen / scrnaseq_processing_seurat Goto Github PK

View Code? Open in Web Editor NEW
10.0 6.0 0.0 1.25 MB

A Snakemake workflow for processing and visualizing (multimodal) sc/snRNA-seq data generated with 10X Genomics Kits or in the MTX matrix file format powered by the R package Seurat.

License: MIT License

Python 37.22% R 62.78%
bioinformatics workflow snakemake scrna-seq biomedical-data-science cite-seq sccrispr-seq 10xgenomics snrna-seq single-cell

scrnaseq_processing_seurat's Introduction

DOI

Single-cell RNA sequencing (scRNA-seq) Data Processing & Visualization Snakemake Workflow

sc/snRNA-seq Data Processing & Visualization Snakemake Workflow powered by Seurat

A Snakemake workflow for processing and visualizing (multimodal) sc/snRNA-seq data generated with 10X Genomics Kits or in the MTX file format powered by the R package Seurat.

This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details, instructions and modules check out the project's repository. Please consider starring and sharing modules that are useful to you, this helps me in prioritizing my efforts!

If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.10679327.

Workflow Rulegraph

Table of contents

Authors

Software

This project wouldn't be possible without the following software

Software Reference (DOI)
inspectdf https://github.com/alastairrushworth/inspectdf/
data.table https://r-datatable.com
SCTransform https://doi.org/10.1186/s13059-019-1874-1
Seurat https://doi.org/10.1016/j.cell.2021.04.048
Snakemake https://doi.org/10.12688/f1000research.29032.2

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post execution in the result directory (/envs/scrnaseq_processing_seurat/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

The outlined analyses were performed using the R package Seurat (ver) [ref] unless stated otherwise.

Merge. The preprocessed/quantified samples were merged using the function Seurat::merge that concatenates individual samples and their metadata into one Seurat object.

Metadata. Metadata was extended with Seurat::PercentageFeatureSet using [X] and by recombination of existing metadata rules [X].

Guide RNA assignment. The guide RNA (gRNA) assignment was performed using protospacer call information provided by the CRISPR functionality of 10x Genomics Cell Ranger (ver) [ref], with additional filtering by UMI thresholds [X] to select high-confidence signals. Each cell was assigned all detected gRNAs and inferred KO targets. (optional) To ensure specificity of the phenotype and avoid cell multiplets, only cells with a single gRNA assignment were kept.

Split. The merged data set was split into subsets by the metadata column(s) [X].

Filtering. The cells were filtered by [X], which resulted in [X] high-quality cells with confident condition and gRNA assignment.

Pseudobulking. We performed pseudobulking of single-cell data to aggregate cells based on [metadata_columns], using the [aggregation_method] method and with a cell count threshold set at [cell_count_threshold] to ensure robust representations. Data from different modalities, including Antibody Capture, CRISPR Guide Capture, and Custom assays, were pseudobulked in the same way. Additionally, the distribution of cell counts across pseudobulked samples was visualized using a histogram and density plot. The resulting pseudobulked data and aggregated metadata were saved for downstream bulk analyses.

Normalization. Filtered count data was normalized using Seurat::SCTransform v2 [ref] with the method parameter glmGamPoi to increase computational efficiency. Other modalities [X] were normalized with Seurat::NormalizeData using method CLR (Centered Log-Ratio) and margin 2.

Cell Cycle Scoring. Cell-cycle scores were determined using the function Seurat::CellCycleScoring using gene lists for M and G2M phase provided by Seurat::cc.genes (Tirosh et al 2015) or [gene lists].

Cell Scoring. Cell-module scores were determined using the function Seurat::AddModuleScore using [gene lists].

Correction. Filtered count data was normalized and corrected using Seurat::SCTransform [ref] with the method parameter glmGamPoi to increase computational efficiency. Identified confounders [X] were used as covariates to be regressed out.

Visualization. To visualize the metadata after each processing step inspectdf (ver) [ref] was used. For the visualization of expression, multimodal [X] data and calculated metadata like module scores, the Seurat functions RidgePlot for ridge plots, VlnPlot for violin plots, DotPlot for dot plots and DoHeatmap for heatmaps were used.

The processing, analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [10.5281/zenodo.10679327].

Features

The workflow perfroms the following steps. Outputs are always in the respective folder {split}/{step}.

  • Preparation (PREP)
    • Load (mutlimodal) data specified in the annotation file and create a Seurat object.
      • cellranger output directory feature-barcode matrices MEX will be loaded with Seurat::Read10X.
      • or data directory with files in MTX format (matrix.mtx, barcodes.tsv, features.tsv) will be loaded with Seurat::ReadMtx.
    • Supported modalities beyond gene expression (RNA) are: Antibody Capture, CRISPR Guide Capture and Custom.
    • (optional) Provided metadata in the annotation file is added to the Seurat object.
    • Metadata is extended with Seurat::PercentageFeatureSet and configured recombination of existing metadata.
    • (CRISPR Guide Capture) Guide RNA and KO target genes are assigned.
      • CRISPR Guide Capture data is filtered first based on cellranger's and then configured UMI thresholds to retain only significant signals, discarding low-confidence gRNA detections.
      • Remaining counts are used to assign detected gRNAs and inferr KO targets to each cell, generating a list of gRNAs (gRNAcall), the count of unique gRNAs (gRNA_n), a list of KO targets (KOcall), and the count of unique KOs (KO_n) for each cell.
      • Both, gRNAcall and KOcall are alphabetically ordered concatenated values in snakecase e.g., geneA_geneB.
      • Note: Cells with different gRNA's targeting the same gene are gRNA multiplets (i.e., gRNA_n>1), but KO singlets (i.e., KO_n=1).
      • Note: gRNA multiplets could be indicative of cell-doublets/multiplets. e.g., gRNAcall: geneX-1_geneX-2 -> KOcall: geneX
      • Tip: To filter for cells with only one gRNA assigned use the logical expression gRNA_n == 1 in the filter configuration.
    • One Seurat object per sample is saved.
  • Merge & Split into subsets (RAW)
    • Merge all samples into one large object, including metadata, called "merged".
    • (optional) Externally provided metadata (e.g., labels from downstream cluster analysis) is added.
    • Split the merged data into subsets using metadata ({metadata_column}__{metadata_value}).

The following steps are performed on each data split separately (including the "merged" split).

  • Filtering (FILTERED)
    • Filter cells by a combination of logical-expressions using the metadata.
  • Pseudobulking (PSEUDOBULK)
    • Pseudobulking is performed based on metadata columns specified by the user, with options for aggregation methods including sum, mean, or median.
    • A cell count threshold is applied to remove pseudobulked samples with fewer cells than specified.
    • The aggregated metadata and pseudobulked data are saved as CSV files to be used for downstream bulk analyses.
    • A histogram and density plot visualizing the distribution of cell counts in pseudobulked samples are provided.
    • Other modalities are also pseudobulked, if applicable.
  • Normalization (NORMALIZED)
    • Normalization of gene expression data using SCTransform v2 (vst.flavor = "v2"), returning normalized values for all expressed genes, filtered by minimum cells per gene parameter.
    • Normalization of all multimodal data using Centered Log-Ratio (CLR) with margin 2.
    • Dynamic highly variable gene (HVG) determination using a residual variance threshold of 1.3 (default). Returning bot a table with statistics and a sorted list of all HVGs.
  • Cell Scoring
    • Cell cycle scoring using Seurat::CellCycleScoring and provided S2- and G2M genes.
    • Gene module scoring using Seurat::AddModuleScore and provided gene lists.
    • All scores and HVGs are determined based on normalized data, not corrected. To avoid redundancy, HVGs are only provided in the NORMALIZED step.
  • Correction (CORRECTED)
    • Normalization and correction for the list of provided confounders using SCTransform v2 (vst.flavor = "v2"), returning scaled values for all expressed genes.
    • The correction is only present in the scaled values (slot="scale.data"), hence only scaled values are saved as CSV, if configured.
    • Corrected data is useful for downstream analyses like dimensionality redcution and clustering, not differential gene expression analysis.
  • Visualization using Ridge-, Violin-, Dot-plots and Heatmaps {split}/{step}/plots/{plot_type}/{category}/{feature_list}/*.png
    • Configuration
      • Gene lists for plotting gene expression data can be provided as plain text files with one gene per line. The top 100 HVGs are always plotted.
      • Feature lists for other modalities can be provided directly in the configuration file.
      • The visualization category/group (vis_categories) indicates the metadata column by which cells are grouped within all plots.
    • Metadata
      • Visualization after each procesing step using inspectdf for all numerical and categorical metadata.
      • Ridge-, Violin-, Dot-plots and Heatmaps (e.g., useful to compare gene module scores).
    • Gene expression (RNA) is always plotted after normalization (slot="data") and correction (slot="scale.data"), respectively.
      • Ridge-, Violin-, Dot-plots and Heatmaps.
      • In step NORMALIZED normalized data (slot="data") is visualized, in step CORRECTED corrected data (slot="scale.data") is visualized.
    • (optional) Other modalities are only visualized after normalization (using slot="data") using Ridge-, Violin-, Dot-plots and Heatmaps.
    • Plot types
      • Metadata plots using inspectdf.
      • Ridge- & Violin-plots are plotted per gene/feature.
      • Dot plots are only generated for NORMALIZED data (slot="data"). Before plotting, averaging and scaling (Min-Maxed based) is performed.
      • Heatmaps always show scaled data (slot="scale.data") and features (rows) are hierachically clustered. In case of >30,000 cells the cells are downsampled to the same size per catgory, which is the smallest among all groups but a minimum of 100 cells.
  • Report: all visualizations, metadata and configuration/annotation files are provided in the Snakemake report.
  • Save counts
    • Functionality to save all counts should be saved as CSV after each processing step for of all modalities. Useful for downstream analyses that are independent of Seurat.
  • Results: all results will be saved in the configured result path, where for each data(sub)set (split) a directory with the following structure is created withinscrnaseq_processing_seurat/:
    • {split}/{step} for all the .rds object files and .CSV files.
      • plots/{plot_type}/{category}/{feature_list}/*.png for all visualizations.
      • stats/ for all metadata derived statistics.

Usage

Here are some tips for the usage of this workflow:

  • Use short sample names in sample_annotation sheet, because they will be the prefix for each barcode/cell in the merged & split datasets.
  • Run the workflow for each step of processing (with the stop_after parameter) and investigate the results (e.g., using the Snakemake report function).
  • Start with a low complexity in the configuration.
  • Try to finish the analysis of the "merged" data set first, and later split the data by using the split_by parameter.
  • In case you want to repeat your analysis based on metadata that emerged from downstream analyses (e.g., clustering, cell-type annotation, perturbation classification) you can provide an incomplete metadatafile in the configuration (i.e., not all cells/barcodes hvae to be present in the metadata). Changes in this file will trigger a rerun of the workflow starting with the merging step to ensure the added metadta is considered in all downstream steps (e.g., splitting, visualization, etc.).

Configuration

Detailed specifications can be found here ./config/README.md

Example

We selected a scRNA-seq data set consisting of 15 CRC samples from Lee et al (2020) Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nature Genetics. Downloaded from the Weizmann Institute - Curated Cancer Cell Atlas (3CA) - Colorectal Cancer section.

  • samples/patients: 15
  • cells: 21657
  • features (genes): 22276
  • total runtime on HPC w/ SLURM (32GB RAM): <24 minutes for 551 jobs in total

A comparison of the cell type marker expression split by cell types visualized as a dot plot.

data source/authors Workflow Output
Cell Type Marker Dot plot Cell Type Marker Dot plot

We provide metadata, annotation and configuration files for this data set in ./test. The UMI count martix has to be downloaded by following the instructions below.

### Download example CRC scRNA-seq data from Lee 2020 Nature Genetics

# Start from repo root
cd scrnaseq_processing_seurat

# Download the .zip file
wget -O data.zip "https://www.dropbox.com/sh/pvauenviguopkue/AADVbccY9ueRVAFTeJEEPxRwa?dl=1" || curl -L "https://www.dropbox.com/sh/pvauenviguopkue/AADVbccY9ueRVAFTeJEEPxRwa?dl=1" -o data.zip

# Unzip and delete the .zip archive
unzip data.zip -d Data_Lee2020_Colorectal
rm data.zip

# Move and rename the UMI count matrix
mv Data_Lee2020_Colorectal/Exp_data_UMIcounts.mtx test/data/Lee2020NatGenet/matrix.mtx

# Remove the unzipped folder
rm -r Data_Lee2020_Colorectal

Beyond this the workflow was tested on multimodal scCRISPR-seq data sets with >350,000 cells and 340 different KO groups (<4h; 99 jobs; 256GB RAM).

Links

Resources

Publications

The following publications successfully used this module for their analyses.

  • ...

scrnaseq_processing_seurat's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scrnaseq_processing_seurat's Issues

restructure results and speed up plotting

  • Check how the report behaves if output is a directory. How to make it still readable and nice?
  • make list of scripts (and rules) to adapt
  • restructure directory architecture in every script to {split}/{step}/plots/{type}/...
  • split panels into individual plots within their own subfolders {split}/{step}/plots/{type}/{vis_cat}/{feature_list}/{gene}.png
    • e.g., merged/NORMALIZED/plots/VlnPlot/condition/MHCgenes/gene1.png
      • Use output directory as result of the rule and use nested directories to encode metadata in addition to the file name? -> Actually possible as Snakemake is aware of metadata through the config file
    • ridge_plot.R
    • metadata_plot.R
    • violin_plot.R (STOPPED here with CORRECTED violins, the error pertains to the falsely inferred wildcard of gene_list -> enforce camelCase)
    • heatmap (collapse/simplify into one rule and adapt directory structure)
    • dotplot (collapse/simplify into one rule and adapt directory structure)
  • improve report by using labels
  • check "rule correct": Why is input NORMALIZED and not FILTERED? (RUNNING -> check if results differ)
    • read up on the correction, maybe its only on the scale.data? maybe module scores and HVG are only based on data...
  • highly variable genes pre and post CORRECTION are the same ie redundant output. make it only for NORMALIZED
  • perform test run on AKsmall subset

ideas

  • consider parallelization: https://stackoverflow.com/questions/8364288/what-hardware-limits-plotting-speed-in-r

  • consider split into jobs i.e., via snakemake -> no loops through feature lists

  • consider change of output device e..g, from PNG to PDF? -> PDF vectors... faster or slower? often far bigger...

  • problem: too many categories (e.g., 300 KOs) would generate too. large plots

    • Split up the plot panels into multiple panels with suffix _1, _2โ€ฆ
    • maybe splitting internally or using Snakmeake (how)? will increase the speed?
    • generate sub folder and split into individual plots -> gigantic parallelization possible but also a lot of file outputs (many many plots) and always lots of time lost to load the object
  • for now: increased size to max 100in in ggsave_new() in utils.R - works for everything BUT heatmaps -> see below

plots -> whats the error, problem? -> slurmstepd: error: Detected 1 oom_kill event in StepId=4572064.batch. Some of the step tasks have been OOM Killed.
heatmaps large but only white
everything else squished or too large to look at
-> check if there is a solution to split a ggplot opbject in multiple in a useful manner -> could go into ggsave_new
-> if too large then put text that says so (previusly done in dea_limma?)

make normalization configurable

provide options from Seurat e.g., log2transform (actually its log1p ie ln(x+1) I think), SCTransform,...
think carefully. Is SCTransform enough/best? Are there any downsides (work)/updsides (???) in adding this functionality? Do I/we need it?

in that case, more functionality is required

speed up rule save_counts (bottleneck)

fwrite

  • test
    • input has to be data.frame
    • changed for now in save_counts.R
  • change in utils.R
  • change in metadata_plot.R
  • sctransform_cellScore.R

fread

  • change in metadata_plot.R
  • merge.R
  • prepare.R

general

  • make mr.pareto issue: look for write.csv across all MR.PARETO modules

https://rdrr.io/cran/data.table/man/fwrite.html

library(data.table)
fwrite(as.data.frame(GetAssayData(object = seurat_object, slot = "scale.data", assay = "SCT")), file = file.path(result_dir, paste0(step, 'scaled_', 'RNA', '.csv')), row.names=TRUE)

# more general
#fast writing
fwrite(as.data.frame(df), file=file.path("path/to/file.csv"), row.names=TRUE)

#fast reading
df <- data.frame(fread(file.path("path/to/file.csv"), header=TRUE), row.names=1)

add example data

Lee 2021 Nat Genet CRC scRNA-seq data set from Weizmann 3CA

  • use marker genes for gene lists of all expected cell types -> get from GPT4

save result object in h5ad format for better interoperability with other packages

provide info in the docs e.g., resources but do not implement as the use cases are highly custom (which assays to put where etc.)

example code:

##converting to h5ad format
obj_m@active.assay <- "RNA"
                              
if(file.exists(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))) file.remove(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))                          
SaveH5Seurat(obj_m, filename = file.path(analysis_dir, "merged", "rna_merged.h5Seurat"))
                              
if(file.exists(file.path(analysis_dir, "merged", "rna_merged.h5ad"))) file.remove(file.path(analysis_dir, "merged", "rna_merged.h5ad"))                              
Convert(file.path(analysis_dir, "merged", "rna_merged.h5Seurat"), dest = "h5ad")

add support for additional metadata file emerged from downstream analyses

add additional metadata file support for merged data, before the split to enable analyses of subsets that emerged from downstream analyses (eg clustering, cell-type annotation, perturbation classification)

config: MERGE section (would also fit with the prefixed barcodes)
rule merge (would trigger reprocessing of all other subsets as well)

OR

config: SPLIT section (would also fit with the prefixed barcodes)
rule split (would not trigger reprocessing of all other subsets as well, bit the merged object does not have the metadata)

extend pseudobulking functionality

  • check Teams discussion on pesudobulking for additional features
  • configurable filtering for cell_count e.g., if <20 cells do not include in output
  • metadata aggregation: keep columns which have the same values within each group -> if easy
  • visualize pseudobulked cell counts? Histograms? Metadata plots?
  • make report.rst for plot

update documentation

  • update current docs (work through current README)
  • incorporate changes to latest features
    • restructuring of results and plots
  • add latest features
    • heatmaps: rows are clustered and cells downsampled
    • pseudobulking
    • extended gRNA & KO call assignment
    • sctransform flavor v2
    • tested on 3 datasets of different sizes (10k-350k cells) and all modalities (RNA, Antibody, CRISPR, Custom)
  • generalize config.yaml

extend KOcall strategy beyond singlets

support multiplets in a meaningful way
alphabetically ordered gene names as KO type in snake_case e.g., KOA_KOB_KOC

  • add column nKOcall (similar to Seurat nomenclature -> check again) describing number of KO genes assigned:
    • Negative (no KO): 0
    • Singlet: 1
    • Multiplet: X
  • KOcall snakecase of all gene names:
    • Negative: NA
    • Singlet: KOA
    • Multiplet (alphabetically): KOA_KOB_KOC
  • add column ngRNA describing number of guides assigned:
    • Negative: 0
    • Singlet: 1
    • Multiplet: X
  • gRNAcall snakecase of all guide calls:
    • Negative: NA
    • Singlet: guide-1
    • Multiplet (alphabetically): guide-1_guide-300

note: gRNA multiplet can be a KO singlet!
e.g., gRNAcall: geneX-1_geneX-2 -> KOcall: geneX

pseudobulking of counts by metadata feature

  • new config field(s) to generate a simple pseudo count matrix for downstream analysis using bulk methods (breaking change)
    • probably a list of categorical metadata
    • pseudobulk: by: ['patient', 'cellType', 'treatment'] method: "sum"
  • look how others do it
  • support multiple methods
    • sum
    • mean
    • median
  • support modalities
    • RNA
    • AB
    • grna
    • custom
  • generate metadata sheet
    • include statistics of pseudobulked cell numbers per sample
  • document r-dplyr=1.1.2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.