GithubHelp home page GithubHelp logo

rnaimehaom / luca Goto Github PK

View Code? Open in Web Editor NEW

This project forked from icbi-lab/luca

0.0 0.0 0.0 987.06 MB

Single-cell Lung Cancer Atlas with 1.2M cells

Home Page: https://luca.icbi.at

License: BSD 3-Clause "New" or "Revised" License

Shell 0.46% Python 81.62% R 2.99% Nextflow 14.94%

luca's Introduction

LuCA - The single-cell Lung Cancer Atlas

DOI

The single cell lung cancer atlas is a resource integrating more than 1.2 million cells from 309 patients across 29 datasets.

The atlas is publicly available for interactive exploration through a cell-x-gene instance. We also provide h5ad objects and a scArches model which allows to project custom datasets into the atlas. For more information, check out the

Salcher, Sturm, Horvath et al., Manuscript in preparation

This repository contains the source-code to reproduce the single-cell data analysis for the paper. The analyses are wrapped into nextflow pipelines, all dependencies are provided as singularity containers, and input data are available from zenodo.

For clarity, the project is split up into two separate workflows:

  • build_atlas: Takes one AnnData object with UMI counts per dataset and integrates them into an atlas.
  • downstream_analyses: Runs analysis tools on the annotated, integrated atlas and produces plots for the publication.

The build_atlas step requires specific hardware (CPU + GPU) for exact reproducibility (see notes on reproducibility) and is relatively computationally expensive. Therefore, the downstream_analysis step can also operate on pre-computed results of the build_atlas step, which are available from zenodo.

Launching the workflows

1. Prerequisites

  • Nextflow, version 21.10.6 or higher
  • Singularity/Apptainer, version 3.7 or higher (tested with 3.7.0-1.el7)
  • A high performance cluster (HPC) or cloud setup. The whole analysis will consume several thousand CPU hours.

2. Obtain data

Before launching the workflow, you need to obtain input data and singularity containers from zenodo. First of all, clone this repository:

git clone https://github.com/icbi-lab/luca.git
cd luca

Then, within the repository, download the data archives and extract then to the corresponding directories:

 # singularity containers
curl "https://zenodo.org/record/6997383/files/containers.tar.xz?download=1" | tar xvJ

# input data
curl "https://zenodo.org/record/6997383/files/input_data.tar.xz?download=1" | tar xvJ

# OPTIONAL: obtain intermediate results if you just want to run the `downstream_analysis` workflow
curl "https://zenodo.org/record/6997383/files/build_atlas_results.tar.xz?download=1" | tar xvJ

Note that some steps of the downstream analysis depend on an additional cohort of checkpoint-inhibitor-treated patients, which is only available under protected access agreement. For obvious reasons, these data are not included in our data archive. You'll need to obtain the dataset yourself and place it in the data/14_ici_treatment/Genentech folder. The corresponding analysis steps are skipped by default. You can enable them by adding the --with_genentech flag to the nextflow run command.

3. Configure nextflow

Depending on your HPC/cloud setup you will need to adjust the nextflow profile in nextflow.config, to tell nextflow how to submit the jobs. Using a withName:... directive, special resources may be assigned to GPU-jobs. You can get an idea by checking out the icbi_lung profile - which we used to run the workflow on our on-premise cluster. Only the build_atlas workflow makes use of GPU processes.

4. Launch the workflows

# Run `build_atlas` workflow
nextflow run main.nf --workflow build_atlas -resume -profile <YOUR_PROFILE> \
    --outdir "./data/20_build_atlas"

# Run `downstream_analysis` workflow
nextflow run main.nf --workflow downstream_analyses -resume -profile <YOUR_PROFILE> \
    --build_atlas_dir "./data/20_build_atlas" \
    --outdir "./data/30_downstream_analyses"

As you can see, the downstream_analysis workflow requires the output of the build_atlas workflow as input. The intermediate results from zenodo contain the output of the build_atlas workflow.

Structure of this repository

  • analyses: Place for e.g. jupyter/rmarkdown notebooks, gropued by their respective (sub-)workflows.
  • bin: executable scripts called by the workflow
  • conf: nextflow configuration files for all processes
  • containers: place for singularity image files. Not part of the git repo and gets created by the download command.
  • data: place for input data and results in different subfolders. Gets populated by the download commands and by running the workflows.
  • lib: custom libraries and helper functions
  • modules: nextflow DSL2.0 modules
  • preprocessing: scripts used to preprocess data upstream of the nextflow workflows. The processed data are part of the archives on zenodo.
  • subworkflows: nextflow subworkflows
  • tables: contains static content that should be under version control (e.g. manually created tables)
  • workflows: the main nextflow workflows

Build atlas workflow

The build_atlas workflow comprises the following steps:

  • QC of the individual datasets based on detected genes, read counts and mitochondrial fractions
  • Merging of all datasets into a single AnnData object. Harmonization of gene symbols.
  • Annotation of two "seed" datasets as input for scANVI.
  • Integration of datasets with scANVI
  • Doublet removal with SOLO
  • Annotation of cell-types based on marker genes and unsupervised leiden clustering.
  • Integration of additional datasets with transfer learning using scArches.

Downstream analysis workflow

  • Patient stratification into immune phenotypes
  • Subclustering and analysis of the neutrophil cluster
  • Differential gene expression analysis using pseudobulk + DESeq2
  • Differential analysis of transcription factors, cancer pathways and cytokine signalling using Dorothea, progeny, and CytoSig.
  • Copy number variation analysis using SCEVAN
  • Cell-type composition analysis using scCODA
  • Association of single cells with phenotypes from bulk RNA-seq datasets with Scissor
  • Cell2cell communication based on differential gene expression and the CellphoneDB database.

Contact

For reproducibility issues or any other requests regarding single-cell data analysis, please use the issue tracker. For anything else, you can reach out to the corresponding author(s) as indicated in the manuscript.

Notes on reproducibility

We aimed at making this workflow reproducible by providing all input data, containerizing all software dependencies and integrating all analysis steps into a nextflow workflow. In theory, this allows to execute the workflow on any system that can run nextflow and singularity. Unfortunately, some single cell analysis algorithms (in particular scVI/scANVI and UMAP) will yield slightly different results on different hardware, trading off computational reproducibility for a significantly faster runtime. In particular, results will differ when changing the number of cores, or when running on a CPU/GPU of a different architecture. See also scverse/scanpy#2014 for a discussion.

Since the cell-type annotation depends on clustering, and the clustering depends on the neighborhood graph, which again depends on the scANVI embedding, running the build_atlas workflow on a different machine will likely break the cell-type labels.

Below is the hardware we used to execute the build_atlas workflow. Theoretically, any CPU/CPU of the same generation shoud produce identical results, but we did not have the chance to test this yet.

  • Compute node CPU: Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz (2x)
  • GPU node CPU: EPYC 7352 24-Core (2x)
  • GPU node GPU: Nvidia Quadro RTX 8000 GPU

luca's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.