GithubHelp home page GithubHelp logo

nf-core / scnanoseq Goto Github PK

View Code? Open in Web Editor NEW
5.0 14.0 4.0 21 MB

Single-cell/nuclei pipeline for data derived from Oxford Nanopore and 10X Genomics

Home Page: https://nf-co.re/scnanoseq/

License: MIT License

HTML 1.13% Python 35.34% Nextflow 60.19% R 2.84% Shell 0.51%
10xgenomics long-read-sequencing nanopore nextflow nf-core scrna-seq single-cell

scnanoseq's Introduction

nf-core/scnanoseq

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo nf-test

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

Introduction

nf-core/scnanoseq is a bioinformatics best-practice analysis pipeline for 10X Genomics single-cell/nuclei RNA-seq for data derived from Oxford Nanopore Q20+ chemistry (R10.4 flow cells (>Q20)). Due to the expectation of >Q20 quality, the input data for the pipeline is not dependent on Illumina paired data. Please note scnanoseq can also process Oxford data with older chemistry, but we encourage usage of the Q20+ chemistry.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

scnanoseq diagram

  1. Raw read QC (FastQC, NanoPlot, NanoComp and ToulligQC)
  2. Unzip and split FastQ (gunzip)
    1. Optional: Split fastq for faster processing (split)
  3. Trim and filter reads. (Nanofilt)
  4. Post trim QC (FastQC, NanoPlot and ToulligQC)
  5. Barcode detection using a custom whitelist or 10X whitelist. BLAZE
  6. Extract barcodes. Consists of the following steps:
    1. Parse FASTQ files into R1 reads containing barcode and UMI and R2 reads containing sequencing without barcode and UMI (custom script ./bin/pre_extract_barcodes.py)
    2. Re-zip FASTQs (pigz)
  7. Post-extraction QC (FastQC, NanoPlot and ToulligQC)
  8. Alignment (minimap2)
  9. SAMtools processing including (SAMtools):
    1. SAM to BAM
    2. Filtering of mapped only reads
    3. Sorting, indexing and obtain mapping metrics
  10. Post-mapping QC in unfiltered BAM files (NanoComp, RSeQC)
  11. Barcode tagging with read quality, BC, BC quality, UMI, and UMI quality (custom script ./bin/tag_barcodes.py)
  12. Barcode correction (custom script ./bin/correct_barcodes.py)
  13. Post correction QC for corrected bams (SAMtools)
  14. UMI-based deduplication UMI-tools
  15. Gene and transcript level matrices generation. IsoQuant
  16. Preliminary matrix QC (Seurat)
  17. Compile QC for raw reads, trimmed reads, pre and post-extracted reads, mapping metrics and preliminary single-cell/nuclei QC (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq,cell_count
CONTROL_REP1,AEG588A1_S1.fastq.gz,5000
CONTROL_REP1,AEG588A1_S2.fastq.gz,5000
CONTROL_REP2,AEG588A2_S1.fastq.gz,5000
CONTROL_REP3,AEG588A3_S1.fastq.gz,5000
CONTROL_REP4,AEG588A4_S1.fastq.gz,5000
CONTROL_REP4,AEG588A4_S2.fastq.gz,5000
CONTROL_REP4,AEG588A4_S3.fastq.gz,5000

Each row represents a single-end fastq file. Rows with the same sample identifier are considered technical replicates and will be automatically merged. cell_count refers to the expected number of cells you expect.

nextflow run nf-core/scnanoseq \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

This pipeline produces feature barcode matrices at both the gene and transcript level and can retain introns within the counts themselves. These files are able to be ingested directly by most packages used for downstream analyses such as Seurat. In addition the pipeline produces a number of quality control metrics to ensure that the samples processed meet expected metrics for single-cell/nuclei data.

Troubleshooting

If you experience any issues, please make sure to submit an issue above. However, some resolutions for common issues will be noted below:

  • One issue that has been observed is a recurrent node failure on slurm clusters that does seem to be related to submission of nextflow jobs. This issue is not related to this pipeline itself, but rather to nextflow itself. Our reserach computing are currently working on a resolution. But we have two methods that appear to help overcome should this issue arise:
    1. The first is to create a custom config that increases the memory request for the job that failed. This may take a couple attempts to find the correct requests, but we have noted that there does appear to be a memory issue occassionally with this errors.
    2. The second resolution is to request an interactive session with a decent amount of time and memory and cpus in order to run the pipeline on the single node. Note that this will take time as there will be minimal parallelization, but this does seem to resolve the issue.
  • We acknowledge that analyzing PromethION is a common use case for this pipeline. Currently, the pipeline has been developed with defaults to analyze GridION and average sized PromethION data. For cases, where jobs have failed due for larger PromethION datasets, the defaults have been overwritten by a custom configuation file (provided by the -c Nextflow option) where resources were increased (substantially in some cases). Below are some of the overrides we have used, while these amounts may not work on every dataset, these will hopefully at least note which processes will need to have their resources increased:

process
{
    withName: '.*:BLAZE'
    {
        cpus = 24
        ext.args = '--threads 30'
    }
}

process
{
    withName: '.*:SAMTOOLS_SORT'
    {
        cpus = 20
    }
}

process
{
    withName: '.*:MINIMAP2_ALIGN'
    {
        cpus = 20
    }
}

process
{
    withName: '.*:ISOQUANT'
    {
        ext.args = {
            [
                "--threads 30",
                "--complete_genedb",
                params.stranded == "forward" ? "--stranded forward" : params.stranded == "reverse" ? "--stranded reverse" : "--stranded none",
            ].join(' ').trim()
        }
        time = '135.h'
    }
}

Credits

nf-core/scnanoseq was originally written by Austyn Trull, and Dr. Lara Ianov.

We would also like to thank the following people and groups for their support, including financial support:

  • Dr. Elizabeth Worthey
  • University of Alabama at Birmingham Biological Data Science Core (U-BDS), RRID:SCR_021766, https://github.com/U-BDS
  • Support from: 3P30CA013148-48S8

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #scnanoseq channel (you can join with this invite).

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

scnanoseq's People

Contributors

atrull314 avatar lianov avatar salome-brunon avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scnanoseq's Issues

Remove filtered BAMs stats from MultiQC and add details in the documentation

Description of feature

  • 1. For readability of the MultiQC Samtools Percent Mapped stats, we can remove filtered BAMs (e.g. corrected BAM after minimap2), to retain only minimap2 stats of mapped vs unmapped.
  • 2. However, users should be aware the BAMs post minimap have been filtered to remove unmapped reads. We should add this in the documentation.

Investigate Nanocomp report overwrites

Description of the bug

Given mutliple FASTQ files, Nanoplot and NanoComp overwrite their own report. We need to fix the output structure to prevent this behavior when multiple FASTQ files for the same sample are provided.

Slack discussion for reference is here

Add 5' support

Description of feature

The pipeline currently supports 3' 10X chemistry but we can add 5' support at minimal effort. We are noting this here to make it clear that this feature is in the roadmap for scnanoseq, however (depending on time) it may not be present in the first release.

Speed up preextract step

Description of feature

Pre extraction of barcodes is a throttle point for the pipeline with larger datasets. Look into ways to speed this step up so the time is not a concern

tmp dir issue in `SORT_GTF`

Description of the bug

Temporary file creation issue noted when using singularity container for SORT_GTF ; seems to be a container specific issue and thus we may need to find an alternative container. Note this was only encountered with the full file test.

Command used and terminal output

params:

# names/email
multiqc_title: "scnanoseq_complete_GridION_test"

# input samplesheet
input: "./samplesheet.csv"

# Genome references
fasta: "/data/project/U_BDS/References/GENCODE/mm10/release_M20/Genome/GRCm38.primary_assembly.genome.fa"
gtf: "/data/project/U_BDS/References/GENCODE/mm10/release_M20/GTF/gencode.vM20.annotation.gtf"

# other params
skip_trimming: false
trimming_software: "nanofilt"
skip_nanoplot: false 
#skip_nanoplot: true

# fastq options
split_amount: 0 

# barcode options
cell_barcode_pattern: ""
cell_amount: 1000
identifier_pattern: "fixed_seq_1,cell_barcode_1,umi_1,fixed_seq_2"
cell_barcode_lengths: "16"
umi_lengths: "12"
fixed_seqs: "CTACACGACGCTCTTCCGATCT, TTTTTTTTTT"
#intron_retention_method: "1"
#counts_level: "gene" # null by default

execution of pipeline:

nextflow run main.nf \
    -params-file ./params.yml \
    --outdir ./results \
    -profile cheaha \
    -resume
Command error:
  sort: cannot create temporary file in '/scratch/lianov': No such file or directory


### Relevant files

_No response_

### System information

UAB HPC - Cheaha (with cheaha config)

Find alternative optimization solution to nf-core modules to avoid linting issues

Description of feature

Given the expectation of promethion datasets, some process which are nf-core modules (e.g.: nanoplot) require changes in default resource allocation which are [by default] very low.

This was accomplished in the past by setting the resource allocation in modules.conf, e.g.:

  withName: '.*:NANOPLOT' {
      memory = '45.GB'
  }

However, for the test profile run, the above resource allocation overwrites the test profile (where max_memory = '6.GB', and leads to a failure in the run from GitHub Actions where 45 GB is too high, and certainly not needed for the test run).

We need to find a better solution to optimizing resources without changes in either the module directly (where we have linting issues, e.g.: "Local copy of module does not match remote" and without using modules.conf given the issue above)

[Note: using custom.conf works as expected, but is there an alternative? Or way to enable custom.conf for GitHub Actions?]

Enhance isoquant outputs for downstream analysis

Description of feature

Current isoquant outputs contain a # in the first row by feature_id. We would want to remove this #, so that functions such as read.table don't ignore cell names as column names which should be preserved for downstream analysis.

Add support for technical replicates

For certain cases, users may have multiple runs derived from a non-PromethION sequencers (multiple GridIONs runs) which are linked to the same sample of origin to acquire sufficient read depth per sample (thus several technical replicates).

Therefore, optional merges will need to be added to address cases of technical replicates. A potential test dataset may be the PromethION data from Blaze (create "technical replicates" from it as downsampled files)

Investigate uncached TAG_BARCODES

Description of the bug

For sample ERR9958135 (Promethion), run was completed but TAG_BARCODES was not cached. We need to investigate this behavior further to prevent long re-run times when the pipeline is started for a minor change (e.g.: MultiQC run only).

Unexpected warning to process dependent on params

Description of the bug

WARN: There's no process matching config selector: .*:SPLIT_FILE -- Did you mean: SPLIT_FILE?

When split_amount: 0 and process SPLIT_FILE is not expected to run, warning above is shown during run. Likely also present for other processes which fall under the same case (dependence on user params / under conditionals). This will likely confuse users.

Seems similar to what has been previously reported in nf-core/tools#1288 (comment) (add if to modules.config for this process. Double check that all others have this [at this time most QCs at least do]).

Command used and terminal output

# default param to split_amount is 0

nextflow run main.nf \
    --outdir ./results \
    -profile test,cheaha \
    -resume

Relevant files

No response

System information

Cheaha
N E X T F L O W ~ version 23.04.1

Fix parsing of UMI tools dedup output to MultiQC

Description

Currently, given our strategy of splitting files by chr. to speed up deduplication, there are multiple QC files reported in the MultiQC report (one dedup QC for each chr). This will lead to over-crowding of the report with multiple samples (both in the General Stats section and UMI-tools. A few strategies around this may include:

  1. Only include main chromossomes (no haplotypes etc., given lower reads in these regions). This may still be too much for the report, but worth a thought (second option more straight-forward)
  2. Create a new UMI-tools stats output based on overall metrics from split files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.