GithubHelp home page GithubHelp logo

openomics / weave Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 1.5 MB

An awesome BCL demultiplexing and FastQ quality-control pipeline

Home Page: https://openomics.github.io/weave/

License: MIT License

Python 98.12% Shell 0.09% Dockerfile 1.79%
bcl2fastq bclconvert contamination-screening disambiguate-genomes fastq-quality-control fastp kaiju kraken2 krona

weave's Introduction

weave 🔬

An awesome demultiplexing and quality control pipeline

tests docs GitHub issues GitHub license

This is the home of the demultiplexing and quality control pipeline, weave. Its long-term goal is to serve as a push-button pipeline for demultiplexing and creation of quality control artifacts for assessing the quality of incoming illumnia sequnecing dataset.

weave DAG

Overview

Welcome to weave's documentation! This guide is the main source of documentation for users that are getting started with the weave.

The ./weave pipeline is composed of two sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions:

weave run
Run the weave pipeline with your input files.

weave cache
Downloads the reference files for the pipeline to a selected directory.

weave is a two-pronged pipeline; the first prong detects and uses the appropriate illumnia software to demultiplex the ensemble collection of reads into their individual samples and converts the sequencing information into the FASTQ file format. From there out the second prong is a distrubted parallele step that uses a variety of commonly accepting nextgen sequencing tools to report, visualize, and calculate the quality of the reads after sequencing. weave makes uses of the ubiquitous containerization software singularity2 for modularity, and the robust pipelining DSL Snakemake3

weave common use is to gauge the qualtiy of reads for potential downstream analysis. Since bioinformatic analysis requires robust and accurate data to draw scientific conclusions, this helps save time and resources when it comes to analyzing the volumous amount of sequencing data that is collected routinely.

Several of the applications that weave uses to visualize and report quality metrics are:

Dependencies

System Requirements: singularity>=3.5
Python Requirements: snakemake>=5.14.0, pyyaml, progressbar, requests, terminaltables, tabulate

Please refer to the complete installation documents for detailed information.

Installation

# clone repo
git clone https://github.com/OpenOmics/weave.git
cd weave
# create virtual environment
python -m venv ~/.my_venv
# activate environment
source ~/.my_venv/bin/activate
pip install -r requirements.txt 

Please refer to the complete installation documents for detailed information.

Contribute

This site is a living document, created for and by members like you. weave is maintained by the members of OpenOmics and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our GitHub repository.

References

1. Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048.
2. Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.
3. Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.
4. Menzel P., Ng K.L., Krogh A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257
5. Wingett SW and Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [version 2; referees: 4 approved]. F1000Research 2018, 7:1338
6. Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890.
7. Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019).

weave's People

Contributors

rroutsong avatar skchronicles avatar

Stargazers

 avatar

Watchers

 avatar  avatar

weave's Issues

Documentation

  • Extend documentation
    • How to install, how to update config for references, fastq_screen + kraken + kaiju
  • github action to deploy

Add option to run disambiguate or separate reads from two organisms (i.e. host vs. virus or host vs. parasite)

Given two reference genome, it would be awesome if we could examine the % composition of each organism, split the reads for each respective organism. This would allow a user to take those split reads and run them in any of our other pipelines depending on the project goal.

Options:

  • Add disambiguate to pipeline
    • Notes: takes aligned reads as input, would need to add different aligners for different inputs (DNA vs. RNA)
  • Run fastq_screen on both genomes, create bowtie2 indices and fastq_screen config file on the fly, split reads on tag
    • Notes: running bowtie2 on RNA is not ideal

Alternative KRAKEN/KAIJU databases from command line

  • Command line argument to accept alternative KRAKEN2 database (other than the default in config/remote.json)
  • Command line argument to accept alternative Kaiju database (other than the default in config/remote.json)
  • Command line argument to accept alternative FastQ_Screen database (other than the default in config/remote.json)

Add flag to weave frontend cli, flow down to rules.

Downloading disambiguate reference files and alternative solutions

About
At the current moment, the cache subcommand of the pipeline does not download disambiguate's reference files, i.e. the bwa indices for each of the supporting reference genomes. As so, these reference files should exist on the host's filesystem prior to execution. These files have already been downloaded/exist on BigSky and Biowulf; however, if the pipeline were to be setup on another cluster, they would need to be downloaded outside the cache subcommand.

Here is an example command to download disambiguate's reference files from helix/biowulf:

rsync -rav -e ssh helix.nih.gov:/data/OpenOmics/references/genomes .

Road map
Here are some proposed long-term solutions:

  1. Move the reference files into our data-share directory for easy downloads, update the cache sub command to pull from this location.
  2. Build the alignment indices on the fly in the output directory and blow them away as a post-processing hook. This should not be a rate-limiting step of the pipeline. It can start running during the bcl2fastq conversion and should be completed way before trimming completes. The only down-side is a slight increase in disk space while the pipeline is running; although if the pipeline cleans up these files after the run completes, it's not really a big deal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.