GAAP-BI

GAAP-BI v. 1.0 - Genome Assembly and Annotation Pipeline for Bacteria Illumina

MIT LICENSE - Copyright © 2022 Gian M.N. Benucci, Ph.D.
email: benucci[at]msu[dot]edu
GAAP-BI v.1, October 2022

This pipeline is based upon work supported by the Great Lakes Bioenergy Research Center, U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award DE-SC0018409

WARNING

This pipeline was built for running on a cluster computer that runs SLURM (Slurm Workload Manager) job scheduler. Any other use will require substantial modification of the code.

Installation

To use GAAP-BI just clone the directory

git clone https://github.com/Gian77/GAAP-BI.git

You will then need to

Copy your raw reads file in the rawdata directory together with a md5sum file.
Install all the necessary tools through conda (please see the complete list of tools reported below).
Download the databases (please see below and the config.yaml file), and include the full paths into the config file.
Select all the User options (please see below).

Then you should be good to go and run GAAP-BI by just sh GAAP-BI-v1.0.sh

Let's explore the settings in the `config.yaml` file

Directories

The main project's directory is: project_dir="/mnt/home/benucci/GAAP-BI/. Of course, you will need to adjust the path to your HPCC user name.

NOTE

This pipeline run using SLURM (please see bove). Resourches of each individual scripts present in the /mnt/home/benucci/GAAP-BI/code/ directory must be adjusted to the amount of data you want to run for each pipeline run. In particular the parameters below.
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
#SBATCH --mem=256G
The individual scripts in the code direcotry include the buy-in node priority #SBATCH -A shade-cole-bonito. If you do not have access to those priority nodes please remove that line in the individual scripts.

You can change the name of the project_dir, but by default is going to be GAAP-BI. Subdirectories such as outputs and slurms are part of the workflow, and should be left as they are.

Please check the config file for options. A few script are additional and are can be avoided to save time.

Default to the `rawdata` directory

When the config is sourced it defaults to the rawdata project directory

cd $project_dir/rawdata/

Running options

If not numeric (i.e. float) then yes or no.

INTERLACED=yes
COVERAGE=2.0
LENGTH=500
MASH=yes
CHROMOSOMES=2
EGGNOG=yes
BAKTA=no

The INTERLACED variable is for when read are interlaced (i.e. R1 and R2 are in the same file), if interlaced then specify yes. The COVERAGE and LENGTH are selexplanaroty and refer to the assembled contigs. The STRING variable differentiate hifi form raw pacbio reads. Usually ccs is present in the raw file name, when found it activates the hifi mode in the Flye assembler. If you want to differentiate using a different string you can change it accordingly. The MASH variable is for running plasmid detection using Mash. The CHROMOSOMES variable is for circularize the genome (default 2, one contig is the bacterial genome and he other contig is the plasmid). However, It is known that some bacteria have multiple circular cromosomes (e.g., Rhodobacter sphaeroides), for reference see here. Feel free to increase to how many chromosomes you think the tazon you are working on may have. If the number of contigs is less than the number of specified chromosomes then Circlator is run on the contings to circularize them. The EGGNOG variable is to run eggnog-mapper to re-classify the proteins detected by Prokka. The BAKTA variable is for running the bakta annotation pipeline.

!For detailes on the tools mentioned above see below.

Databases

Find a place in your lab space (or in your home directory) where to put all the needed databases. Correct the directory hierarchy according to your HPCC account name. For example, in my case all the databases are in /mnt/research/ShadeLab/Benucci/databases/.

NCBI_nt="/mnt/research/ShadeLab/Benucci/databases/ncbi_nt1121"
kraken2_db="/mnt/research/ShadeLab/Benucci/databases/kraken2_db/"
minikraken2_db="/mnt/research/ShadeLab/Benucci/databases/minikraken2_db/"
platon_db="/mnt/research/ShadeLab/Benucci/databases/platon_db/db/"
mash_plsdb="/mnt/research/ShadeLab/Benucci/databases/plasmid_db/plsdb.msh"
mash_plsdb_meta="/mnt/research/ShadeLab/Benucci/databases/plasmid_db/plsdb.tsv"
busco_db="/mnt/research/ShadeLab/Benucci/databases/busco_db1121/bacteria_odb10"
phix_db="/mnt/research/ShadeLab/Benucci/databases/phix_index/my_phix"
gunc_db="/mnt/research/ShadeLab/Benucci/databases/gunc_db"
bakta_db="/mnt/research/ShadeLab/Benucci/databases/bakta_db/db"

PATHs

export GTDBTK_DATA_PATH=/mnt/research/ShadeLab/Benucci/databases/gtdb_tk/release207_v2
export EGGNOG_DATA_DIR=/mnt/research/ShadeLab/Benucci/databases/emapperdb/
export GUNC_DB=/mnt/research/ShadeLab/Benucci/databases/gunc_db/gunc_db_progenomes2.1.dmnd

Software needed to fully run Gisa-PB on HPCC

Please install via conda (or use the binaries in the HPPC) of all these software below.

FastQC, NanoStat, Flye, Minimap2, Racon, Pilon, Circlator, quast, qualimap, BLAST, BlobTools, Platon, Mash, Kraken 2, Bracken, CheckM, Gunc, GTDB-Tk, Barrnap, Metaxa2, eggNOG, Prokka, BUSCO, ABRicate, MultiQC, Bandage, R.

NOTE
Make sure you install tydiverse and ggplot2 along with R.

config options

Please check the config file for options. A few script are additional and can be avoided if needed.

Pipeline Results

All the useful results, generated reports, and slurms outputs for each script and for each assembled genome, will be available as tar.gz file at this PATH $project_dir/outputs/25_exportreports-bash. If you are interested in all the other files generated by the tools run by the peipeline you can explore all the directories in the $project_dir/outputs/ directory before erasing it.

Acknowledgements

Many thanks to the Institute for Cyber-Enabled Research (ICER) for helping troubshoting SLURM, the DOE Joint Genome Institute for providing the sequence data, methods, and metadata support, the whole crew at the Shade lab for brainstorming over methods and tools, and my good friend and colleague, Dr. Livio Antonielli at the AIT Austrian Institute of Technology GmbH for giving me a baseline and for the insiteful discussions and bioiformatic suggestions.

gian77 / gaap-bi Goto Github PK

gaap-bi's Introduction

GAAP-BI

This pipeline is based upon work supported by the Great Lakes Bioenergy Research Center, U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award DE-SC0018409

This pipeline was built for running on a cluster computer that runs SLURM (Slurm Workload Manager) job scheduler. Any other use will require substantial modification of the code.

Installation

Let's explore the settings in the config.yaml file

Directories

Default to the rawdata directory

Running options

Databases

PATHs

Software needed to fully run Gisa-PB on HPCC

FastQC, NanoStat, Flye, Minimap2, Racon, Pilon, Circlator, quast, qualimap, BLAST, BlobTools, Platon, Mash, Kraken 2, Bracken, CheckM, Gunc, GTDB-Tk, Barrnap, Metaxa2, eggNOG, Prokka, BUSCO, ABRicate, MultiQC, Bandage, R.

config options

Pipeline Results

Acknowledgements

gaap-bi's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Let's explore the settings in the `config.yaml` file

Default to the `rawdata` directory