GithubHelp home page GithubHelp logo

nhhaidee / nf-flu Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cfia-ncfad/nf-flu

0.0 0.0 0.0 3.06 MB

Influenza genome analysis Nextflow workflow

License: MIT License

Python 34.39% Groovy 21.81% Nextflow 38.28% Shell 3.44% V 2.08%

nf-flu's Introduction

CFIA-NCFAD/nf-flu - Influenza A and B Virus Genome Assembly Nextflow Workflow

DOI CI

Nextflow run with conda run with docker run with singularity

Introduction

nf-flu is a Nextflow bioinformatics analysis pipeline for assembly and H/N subtyping of Influenza A and B viruses from Illumina or Nanopore sequencing data. Since Influenza has a segmented genome consisting of 8 gene segments, the pipeline will automatically select the top matching reference sequence from NCBI for each gene segment based on IRMA assembly and nucleotide BLAST against all Influenza sequences from NCBI. Users can also provide their own reference sequences to include in the top reference sequence selection process. After reference sequence selection, the pipeline performs read mapping to each reference sequence, variant calling and depth-masked consensus sequence generation.

Pipeline summary

  1. Download latest NCBI Orthomyxoviridae sequences and metadata (parsed from NCBI Viruses FTP data).
  2. Merge reads of re-sequenced samples (cat) (if needed).
  3. Assembly of Influenza gene segments with IRMA using the built-in FLU module
  4. Nucleotide BLAST search against NCBI Influenza DB sequences
  5. H/N subtype prediction and Excel XLSX report generation based on BLAST results.
  6. Automatically select top match reference sequences for segments
  7. Read mapping, variant calling and consensus sequence generation for each segment against top reference sequence based on BLAST results.
  8. Annotation of consensus sequences with VADR
  9. MultiQC report generation.

Quick Start

  1. Install Nextflow (>=21.04.0).

  2. Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort)

  3. Download the pipeline and test it on a minimal dataset with a single command:

    For Illumina workflow test:

    nextflow run CFIA-NCFAD/nf-flu -profile test_illumina,<docker/singularity/podman/shifter/charliecloud/conda> \
      --max_cpus $(nproc) # use all available CPUs; default is 2

    For Nanopore workflow test:

    nextflow run CFIA-NCFAD/nf-flu -profile test_nanopore,<docker/singularity/podman/shifter/charliecloud/conda> \
      --max_cpus $(nproc) # use all available CPUs; default is 2
    • If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the nf-core download command to pre-download all of the required containers before running the pipeline and to set the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
    • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
  4. Run your own analysis

    • [Optional] Generate an input samplesheet from a directory containing Illumina FASTQ files (e.g. /path/to/illumina_run/Data/Intensities/Basecalls/) with the included Python script fastq_dir_to_samplesheet.py before you run the pipeline (requires Python 3 installed locally) e.g.

      python ~/.nextflow/assets/CFIA-NCFAD/nf-flu/bin/fastq_dir_to_samplesheet.py \
        -i /path/to/illumina_run/Data/Intensities/Basecalls/ \
        -o samplesheet.csv
    • Typical command for Illumina Platform

      nextflow run CFIA-NCFAD/nf-flu \
        --input samplesheet.csv \
        --platform illumina \
        --profile <docker/singularity/podman/shifter/charliecloud/conda>
    • Typical command for Nanopore Platform

      nextflow run CFIA-NCFAD/nf-flu \
        --input samplesheet.csv \
        --platform nanopore \
        --profile <docker/singularity/conda>

Documentation

The nf-flu pipeline comes with:

Resources and References

Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H., 2021. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008. https://doi.org/10.1093/gigascience/giab008

BLAST Basic Local Alignment Search Tool

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L., 2009. BLAST+: architecture and applications. BMC Bioinformatics 10, 421. https://doi.org/10.1186/1471-2105-10-421
Zheng, Z., Li, S., Su, J., Leung, A.W.-S., Lam, T.-W., Luo, R., 2022. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat Comput Sci 2, 797–803. https://doi.org/10.1038/s43588-022-00387-x

IRMA Iterative Refinement Meta-Assembler

Shepard, S.S., Meno, S., Bahl, J., Wilson, M.M., Barnes, J., Neuhaus, E., 2016. Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler. BMC Genomics 17, 708. https://doi.org/10.1186/s12864-016-3030-6

Medaka is deprecated in favour of Clair3 for variant calling of Nanopore data.

Minimap2 is used for rapid and accurate read alignment to reference sequences.

Li, H., 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100. https://doi.org/10.1093/bioinformatics/bty191

Mosdepth is used for rapid sequencing coverage calculation and summary statistics.

Pedersen, B.S., Quinlan, A.R., 2017. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868. https://doi.org/10.1093/bioinformatics/btx699

MultiQC is used for generation of a single report for multiple tools.

Ewels, P., Magnusson, M., Lundin, S., Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048. https://doi.org/10.1093/bioinformatics/btw354

nf-flu relies on publicly available Influenza sequence data from NCBI available at the NCBI Influenza Virus Resource, which is downloaded from the FTP site.

NCBI Influenza Virus Resource:

Bao, Y., Bolotov, P., Dernovoy, D., Kiryutin, B., Zaslavsky, L., Tatusova, T., Ostell, J., Lipman, D., 2008. The influenza virus resource at the National Center for Biotechnology Information. J Virol 82, 596–601. https://doi.org/10.1128/JVI.02005-07

NCBI Influenza Virus Sequence Annotation Tool:

Bao, Y., Bolotov, P., Dernovoy, D., Kiryutin, B., Tatusova, T., 2007. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Res 35, W280-284. https://doi.org/10.1093/nar/gkm354

nf-flu is implemented in Nextflow.

Tommaso, P.D., Chatzou, M., Floden, E.W., Barja, P.P., Palumbo, E., Notredame, C., 2017. Nextflow enables reproducible computational workflows. Nat Biotechnol 35, 316–319. https://doi.org/10.1038/nbt.3820

nf-core is a great resource for building robust and reproducible bioinformatics pipelines.

Ewels, P.A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M.U., Di Tommaso, P., Nahnsen, S., 2020. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278. https://doi.org/10.1038/s41587-020-0439-x

seqtk is used for rapid manipulation of FASTA/Q files. Available from GitHub at lh3/seqtk

VADR is used for annotation of Influenza virus sequences.

Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3

table2asn is used for converting the VADR Feature Table format output to Genbank format to help with conversion to other formats such as FASTA and GFF.

Credits

The nf-flu pipeline was originally developed by Peter Kruczkiewicz from CFIA-NCFAD, Hai Nguyen extended the piepline for Nanopore data analysis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.