openomics / weave Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 1.5 MB

An awesome BCL demultiplexing and FastQ quality-control pipeline

Home Page: https://openomics.github.io/weave/

License: MIT License

Python 98.12% Shell 0.09% Dockerfile 1.79%

bcl2fastq bclconvert contamination-screening disambiguate-genomes fastq-quality-control fastp kaiju kraken2 krona

weave's Introduction

weave 🔬

An awesome demultiplexing and quality control pipeline

This is the home of the demultiplexing and quality control pipeline, weave. Its long-term goal is to serve as a push-button pipeline for demultiplexing and creation of quality control artifacts for assessing the quality of incoming illumnia sequnecing dataset.

Overview

Welcome to weave's documentation! This guide is the main source of documentation for users that are getting started with the weave.

The ./weave pipeline is composed of two sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

weave run
Run the weave pipeline with your input files.

weave cache
Downloads the reference files for the pipeline to a selected directory.

weave is a two-pronged pipeline; the first prong detects and uses the appropriate illumnia software to demultiplex the ensemble collection of reads into their individual samples and converts the sequencing information into the FASTQ file format. From there out the second prong is a distrubted parallele step that uses a variety of commonly accepting nextgen sequencing tools to report, visualize, and calculate the quality of the reads after sequencing. weave makes uses of the ubiquitous containerization software singularity² for modularity, and the robust pipelining DSL Snakemake³

weave common use is to gauge the qualtiy of reads for potential downstream analysis. Since bioinformatic analysis requires robust and accurate data to draw scientific conclusions, this helps save time and resources when it comes to analyzing the volumous amount of sequencing data that is collected routinely.

Several of the applications that weave uses to visualize and report quality metrics are:

Kraken⁷, kmer analysis
Kaiju⁴, kmer analysis
FastQC, fastq statistics
fastp⁶, fastq adapter removal (trimming)
FastQ Screen⁵, taxonomic quantification
MultiQC¹, ensemble QC results

Dependencies

System Requirements: singularity>=3.5
Python Requirements: snakemake>=5.14.0, pyyaml, progressbar, requests, terminaltables, tabulate

Please refer to the complete installation documents for detailed information.

Installation

# clone repo
git clone https://github.com/OpenOmics/weave.git
cd weave
# create virtual environment
python -m venv ~/.my_venv
# activate environment
source ~/.my_venv/bin/activate
pip install -r requirements.txt

Please refer to the complete installation documents for detailed information.

Contribute

This site is a living document, created for and by members like you. weave is maintained by the members of OpenOmics and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository.

References

^{1. Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048.}
^{2. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.}
^{3. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.}
^{4. Menzel P., Ng K.L., Krogh A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257}
^{5. Wingett SW and Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [version 2; referees: 4 approved]. F1000Research 2018, 7:1338}
^{6. Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890.}
^{7. Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019).}

weave's People

Contributors

Stargazers

Watchers

weave's Issues

Move node configurations to serializable file

          In the next PR, can we move to pulling `threads` information from a config file?

Originally posted by @skchronicles in #6 (comment)

Documentation

Extend documentation
- How to install, how to update config for references, fastq_screen + kraken + kaiju
github action to deploy

enforce submitting master job to slurm

Submit master snakemake job to slurm instead of running directly on parent system

Up file latency wait time to 120s

          In the next PR, can you set `latency-wait` to 120 seconds?

Originally posted by @skchronicles in #6 (comment)

Add option to run disambiguate or separate reads from two organisms (i.e. host vs. virus or host vs. parasite)

Given two reference genome, it would be awesome if we could examine the % composition of each organism, split the reads for each respective organism. This would allow a user to take those split reads and run them in any of our other pipelines depending on the project goal.

Options:

Add disambiguate to pipeline
- Notes: takes aligned reads as input, would need to add different aligners for different inputs (DNA vs. RNA)

Run fastq_screen on both genomes, create bowtie2 indices and fastq_screen config file on the fly, split reads on tag

Notes: running bowtie2 on RNA is not ideal

Seperate trimmed fastqc and fastqc from multiqc report

See: https://multiqc.info/docs/reports/customisation/#running-modules-multiple-times

Need trimmed and untrimmed fastqc result to show up in separate sections

Kaiju: create Genus, Family, Order, Class output files

Let's make the output more similar to kraken2. We need to update the Kaiju shell command to create tabular output files for the following:

Genus, Family, Order, Class

Update to latest version of multiqc

update to latest version of multiqc in the docker image

Add support for single end sequencing

Support single end functionality

implement MultiQC in ngsqc pipeline

Add multiqc to ngsqc pipeline

Ensure name transition from Dmux to weave

Check config/*.json

Rename --pretend flag

          In the next PR, let's rename the `--pretend | -p` option to `--dry-run | -n`.

Originally posted by @skchronicles in #6 (comment)

Transfer from pip packaging into git module type packaging

Want to configure the server context git sub module instead of pip packaging

Make standalone illumnia sample sheet parser

Remove terminaltables and sample sheet dependencies

Alternative KRAKEN/KAIJU databases from command line

Command line argument to accept alternative KRAKEN2 database (other than the default in config/remote.json)
Command line argument to accept alternative Kaiju database (other than the default in config/remote.json)
Command line argument to accept alternative FastQ_Screen database (other than the default in config/remote.json)

Add flag to weave frontend cli, flow down to rules.

Downloading disambiguate reference files and alternative solutions

About
At the current moment, the cache subcommand of the pipeline does not download disambiguate's reference files, i.e. the bwa indices for each of the supporting reference genomes. As so, these reference files should exist on the host's filesystem prior to execution. These files have already been downloaded/exist on BigSky and Biowulf; however, if the pipeline were to be setup on another cluster, they would need to be downloaded outside the cache subcommand.

Here is an example command to download disambiguate's reference files from helix/biowulf:

rsync -rav -e ssh helix.nih.gov:/data/OpenOmics/references/genomes .

Road map
Here are some proposed long-term solutions:

Move the reference files into our data-share directory for easy downloads, update the cache sub command to pull from this location.
Build the alignment indices on the fly in the output directory and blow them away as a post-processing hook. This should not be a rate-limiting step of the pipeline. It can start running during the bcl2fastq conversion and should be completed way before trimming completes. The only down-side is a slight increase in disk space while the pipeline is running; although if the pipeline cleans up these files after the run completes, it's not really a big deal.

Add cryptographic checksums for fastq.gz files

Compute md5sum and SHA256 on fastq.gz files and concatenate into run-centric list

Add support bclconvert

Add in support for running bclconvert instead of bcl2fastq for NextSeq2k

openomics / weave Goto Github PK

weave's Introduction

weave 🔬

Overview

Dependencies

Installation

Contribute

References

weave's People

Contributors

Stargazers

Watchers

weave's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs