nf-core / scnanoseq Goto Github PK

View Code? Open in Web Editor NEW

5.0 14.0 4.0 21 MB

Single-cell/nuclei pipeline for data derived from Oxford Nanopore and 10X Genomics

Home Page: https://nf-co.re/scnanoseq/

License: MIT License

HTML 1.13% Python 35.34% Nextflow 60.19% R 2.84% Shell 0.51%

10xgenomics long-read-sequencing nanopore nextflow nf-core scrna-seq single-cell

scnanoseq's Introduction

Introduction

nf-core/scnanoseq is a bioinformatics best-practice analysis pipeline for 10X Genomics single-cell/nuclei RNA-seq for data derived from Oxford Nanopore Q20+ chemistry (R10.4 flow cells (>Q20)). Due to the expectation of >Q20 quality, the input data for the pipeline is not dependent on Illumina paired data. Please note scnanoseq can also process Oxford data with older chemistry, but we encourage usage of the Q20+ chemistry.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

Raw read QC (FastQC, NanoPlot, NanoComp and ToulligQC)
Unzip and split FastQ (gunzip)
1. Optional: Split fastq for faster processing (split)
Trim and filter reads. (Nanofilt)
Post trim QC (FastQC, NanoPlot and ToulligQC)
Barcode detection using a custom whitelist or 10X whitelist. BLAZE
Extract barcodes. Consists of the following steps:
1. Parse FASTQ files into R1 reads containing barcode and UMI and R2 reads containing sequencing without barcode and UMI (custom script ./bin/pre_extract_barcodes.py)
2. Re-zip FASTQs (pigz)
Post-extraction QC (FastQC, NanoPlot and ToulligQC)
Alignment (minimap2)
SAMtools processing including (SAMtools):
1. SAM to BAM
2. Filtering of mapped only reads
3. Sorting, indexing and obtain mapping metrics
Post-mapping QC in unfiltered BAM files (NanoComp, RSeQC)
Barcode tagging with read quality, BC, BC quality, UMI, and UMI quality (custom script ./bin/tag_barcodes.py)
Barcode correction (custom script ./bin/correct_barcodes.py)
Post correction QC for corrected bams (SAMtools)
UMI-based deduplication UMI-tools
Gene and transcript level matrices generation. IsoQuant
Preliminary matrix QC (Seurat)
Compile QC for raw reads, trimmed reads, pre and post-extracted reads, mapping metrics and preliminary single-cell/nuclei QC (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq,cell_count
CONTROL_REP1,AEG588A1_S1.fastq.gz,5000
CONTROL_REP1,AEG588A1_S2.fastq.gz,5000
CONTROL_REP2,AEG588A2_S1.fastq.gz,5000
CONTROL_REP3,AEG588A3_S1.fastq.gz,5000
CONTROL_REP4,AEG588A4_S1.fastq.gz,5000
CONTROL_REP4,AEG588A4_S2.fastq.gz,5000
CONTROL_REP4,AEG588A4_S3.fastq.gz,5000

Each row represents a single-end fastq file. Rows with the same sample identifier are considered technical replicates and will be automatically merged. cell_count refers to the expected number of cells you expect.

nextflow run nf-core/scnanoseq \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

This pipeline produces feature barcode matrices at both the gene and transcript level and can retain introns within the counts themselves. These files are able to be ingested directly by most packages used for downstream analyses such as Seurat. In addition the pipeline produces a number of quality control metrics to ensure that the samples processed meet expected metrics for single-cell/nuclei data.

Troubleshooting

If you experience any issues, please make sure to submit an issue above. However, some resolutions for common issues will be noted below:

One issue that has been observed is a recurrent node failure on slurm clusters that does seem to be related to submission of nextflow jobs. This issue is not related to this pipeline itself, but rather to nextflow itself. Our reserach computing are currently working on a resolution. But we have two methods that appear to help overcome should this issue arise:
1. The first is to create a custom config that increases the memory request for the job that failed. This may take a couple attempts to find the correct requests, but we have noted that there does appear to be a memory issue occassionally with this errors.
2. The second resolution is to request an interactive session with a decent amount of time and memory and cpus in order to run the pipeline on the single node. Note that this will take time as there will be minimal parallelization, but this does seem to resolve the issue.
We acknowledge that analyzing PromethION is a common use case for this pipeline. Currently, the pipeline has been developed with defaults to analyze GridION and average sized PromethION data. For cases, where jobs have failed due for larger PromethION datasets, the defaults have been overwritten by a custom configuation file (provided by the -c Nextflow option) where resources were increased (substantially in some cases). Below are some of the overrides we have used, while these amounts may not work on every dataset, these will hopefully at least note which processes will need to have their resources increased:


process
{
    withName: '.*:BLAZE'
    {
        cpus = 24
        ext.args = '--threads 30'
    }
}

process
{
    withName: '.*:SAMTOOLS_SORT'
    {
        cpus = 20
    }
}

process
{
    withName: '.*:MINIMAP2_ALIGN'
    {
        cpus = 20
    }
}

process
{
    withName: '.*:ISOQUANT'
    {
        ext.args = {
            [
                "--threads 30",
                "--complete_genedb",
                params.stranded == "forward" ? "--stranded forward" : params.stranded == "reverse" ? "--stranded reverse" : "--stranded none",
            ].join(' ').trim()
        }
        time = '135.h'
    }
}

Credits

nf-core/scnanoseq was originally written by Austyn Trull, and Dr. Lara Ianov.

We would also like to thank the following people and groups for their support, including financial support:

Dr. Elizabeth Worthey
University of Alabama at Birmingham Biological Data Science Core (U-BDS), RRID:SCR_021766, https://github.com/U-BDS
Support from: 3P30CA013148-48S8

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #scnanoseq channel (you can join with this invite).

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

scnanoseq's People

Contributors

Stargazers

Watchers

Forkers

u-bds salome-brunon lianov

scnanoseq's Issues

Remove filtered BAMs stats from MultiQC and add details in the documentation

Description of feature

1. For readability of the MultiQC Samtools Percent Mapped stats, we can remove filtered BAMs (e.g. corrected BAM after minimap2), to retain only minimap2 stats of mapped vs unmapped.
2. However, users should be aware the BAMs post minimap have been filtered to remove unmapped reads. We should add this in the documentation.

Re-enable MultiQC for NanoComp once MultiQC issue is solved

Description of feature

Related to MultiQC/MultiQC#1630.

This MultiQC issue is currently in the milestone list for the next release. This scnanoseq issue is a reminder re-enable MultiQC on NanoComp outputs once it is available.

Investigate Nanocomp report overwrites

Description of the bug

Given mutliple FASTQ files, Nanoplot and NanoComp overwrite their own report. We need to fix the output structure to prevent this behavior when multiple FASTQ files for the same sample are provided.

Slack discussion for reference is here

Add 5' support

Description of feature

The pipeline currently supports 3' 10X chemistry but we can add 5' support at minimal effort. We are noting this here to make it clear that this feature is in the roadmap for scnanoseq, however (depending on time) it may not be present in the first release.

Make threads in Blaze a param

Description of feature

During a test run, I noted that Blaze can take a while to run. We have already included the Blaze thread param in the module, but need to make it modular through a param. Quick reference to section to be changed is the following:

https://github.com/U-BDS/scnanoseq/blob/3edc7d216b78b8c57fbe6940d25b57eace74d4fc/modules/local/blaze.nf#LL32C11-L32C18

Speed up preextract step

Description of feature

Pre extraction of barcodes is a throttle point for the pipeline with larger datasets. Look into ways to speed this step up so the time is not a concern

tmp dir issue in `SORT_GTF`

Description of the bug

Temporary file creation issue noted when using singularity container for SORT_GTF ; seems to be a container specific issue and thus we may need to find an alternative container. Note this was only encountered with the full file test.

Command used and terminal output

params:

# names/email
multiqc_title: "scnanoseq_complete_GridION_test"

# input samplesheet
input: "./samplesheet.csv"

# Genome references
fasta: "/data/project/U_BDS/References/GENCODE/mm10/release_M20/Genome/GRCm38.primary_assembly.genome.fa"
gtf: "/data/project/U_BDS/References/GENCODE/mm10/release_M20/GTF/gencode.vM20.annotation.gtf"

# other params
skip_trimming: false
trimming_software: "nanofilt"
skip_nanoplot: false 
#skip_nanoplot: true

# fastq options
split_amount: 0 

# barcode options
cell_barcode_pattern: ""
cell_amount: 1000
identifier_pattern: "fixed_seq_1,cell_barcode_1,umi_1,fixed_seq_2"
cell_barcode_lengths: "16"
umi_lengths: "12"
fixed_seqs: "CTACACGACGCTCTTCCGATCT, TTTTTTTTTT"
#intron_retention_method: "1"
#counts_level: "gene" # null by default

execution of pipeline:

nextflow run main.nf \
    -params-file ./params.yml \
    --outdir ./results \
    -profile cheaha \
    -resume

Command error:
  sort: cannot create temporary file in '/scratch/lianov': No such file or directory



### Relevant files

_No response_

### System information

UAB HPC - Cheaha (with cheaha config)

Find alternative optimization solution to nf-core modules to avoid linting issues

Description of feature

Given the expectation of promethion datasets, some process which are nf-core modules (e.g.: nanoplot) require changes in default resource allocation which are [by default] very low.

This was accomplished in the past by setting the resource allocation in modules.conf, e.g.:

  withName: '.*:NANOPLOT' {
      memory = '45.GB'
  }

However, for the test profile run, the above resource allocation overwrites the test profile (where max_memory = '6.GB', and leads to a failure in the run from GitHub Actions where 45 GB is too high, and certainly not needed for the test run).

We need to find a better solution to optimizing resources without changes in either the module directly (where we have linting issues, e.g.: "Local copy of module does not match remote" and without using modules.conf given the issue above)

[Note: using custom.conf works as expected, but is there an alternative? Or way to enable custom.conf for GitHub Actions?]

Enhance isoquant outputs for downstream analysis

Description of feature

Current isoquant outputs contain a # in the first row by feature_id. We would want to remove this #, so that functions such as read.table don't ignore cell names as column names which should be preserved for downstream analysis.

Add support for technical replicates

For certain cases, users may have multiple runs derived from a non-PromethION sequencers (multiple GridIONs runs) which are linked to the same sample of origin to acquire sufficient read depth per sample (thus several technical replicates).

Therefore, optional merges will need to be added to address cases of technical replicates. A potential test dataset may be the PromethION data from Blaze (create "technical replicates" from it as downsampled files)

Investigate uncached TAG_BARCODES

Description of the bug

For sample ERR9958135 (Promethion), run was completed but TAG_BARCODES was not cached. We need to investigate this behavior further to prevent long re-run times when the pipeline is started for a minor change (e.g.: MultiQC run only).

Unexpected warning to process dependent on params

Description of the bug

WARN: There's no process matching config selector: .*:SPLIT_FILE -- Did you mean: SPLIT_FILE?

When split_amount: 0 and process SPLIT_FILE is not expected to run, warning above is shown during run. Likely also present for other processes which fall under the same case (dependence on user params / under conditionals). This will likely confuse users.

Seems similar to what has been previously reported in nf-core/tools#1288 (comment) (add if to modules.config for this process. Double check that all others have this [at this time most QCs at least do]).

Command used and terminal output

# default param to split_amount is 0

nextflow run main.nf \
    --outdir ./results \
    -profile test,cheaha \
    -resume

Relevant files

No response

System information

Cheaha
N E X T F L O W ~ version 23.04.1

Add ToulligQC to raw read QC and post-extraction QC

Description of feature

Add ToulligQC to raw read QC and post-extraction QC

Example report can be found at here

nf-core module: https://nf-co.re/modules/toulligqc

Slack discussion present here.

Fix parsing of UMI tools dedup output to MultiQC

Description

Currently, given our strategy of splitting files by chr. to speed up deduplication, there are multiple QC files reported in the MultiQC report (one dedup QC for each chr). This will lead to over-crowding of the report with multiple samples (both in the General Stats section and UMI-tools. A few strategies around this may include:

Only include main chromossomes (no haplotypes etc., given lower reads in these regions). This may still be too much for the report, but worth a thought (second option more straight-forward)
Create a new UMI-tools stats output based on overall metrics from split files.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble