The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

This is the Github repository for the Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP). For the analysis of bacterial metagenomic and metatranscriptomic samples more and more tools become available, although these tools are not capable of profiling specific metabolic gene clusters (MGCs), that have been shown to be major phenotype drivers. Therefore, this tool is focussed on finding the representation of MGCs and their related homologs in metagenomic and metatranscriptomic samples. These pathways are readily obtained from (draft) bacterial genomes using antiSMASH or gutSMASH. To be able to process the outputs from these tools into proper abundance and expressions values, the following programs form the essential part of BiG-MAP:

BiG-MAP.download.py
BiG-MAP.family.py
BiG-MAP.map.py
BiG-MAP.analyse.py

For information on how to implement this program, scroll down to Overview and example run.

Installation

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/HAugustijn/BiG-MAP.git

Then install all the dependencies from the BiG-MAP.yml file with:

# For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py
~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process
~$ conda activate BiG-MAP_process

# For BiG-MAP.analyse.py
~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse
~$ conda activate BiG-MAP_analyse

To make use of the second redundancy filtering step, download BiG-SCAPE using:

~$ git clone https://git.wageningenur.nl/medema-group/BiG-SCAPE

After this all the dependencies are installed. BiG-MAP can now be used.

Overview and example run

A typical workflow for BiG-MAP consists of the following 4 consecutive steps:

Downloading WGS data using BiG-MAP.download.py
Generating gene cluster families (GCFs) and housekeeping gene families (HGFs) using BiG-MAP.family.py
Computing abundance and expression profiles of selected representatives from each GCF and HGF using BiG-MAP.map.py
Analysing the resulting BIOM file for profiles using BiG-MAP.analyse.py

The four steps are described below, and for each an example is provided.

1) BiG-MAP.download.py

This script is created to easily download the metagenomic and/or metatranscriptomic samples from the online NCBI repository. First, the samples are downloaded in .SRA format, and then they are converted into .fastq pairs using fastq-dump.

conda activate BiG-MAP_process
python3 BiG-MAP.download.py -h
python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]

To download the samples, go to the SRA run selector and fill in the study code. For the IBD-cohort of schirmer et al. (2018) that is PRJNA389280. Next, select the accessions and click Accession List to download the accessions. Use this accession file in the following command:

python3 BiG-MAP.download.py -A Acc_list.txt -O /mnt/scratch/usr001/fastq/schirmer/

Acc_list.txt:
SRR5983273
SRR5983265
SRR5983266
SRR5983268
SRR5983270
SRR5983271
SRR5983275
...

2) BiG-MAP.family.py

The main purpose of this script is to compute GCFs and HGFs using sequence similarity as sole metric. For GCF computation, protein sequences are used while for the HGF computation DNA sequences are used. Mash is implemented to compute the GCFs and HGFs. The input consists of the output directories of anti- or gutSMASH. Options can be investigated by running the -h flag. General usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.family.py -h
python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]

In the example of a gutSMASH run on 1520 (draft) reference genomes that are present in the gut, with a Mash treshold of 0.1 for GFCs and 0.1 for HGFs, no flanking genes of the core, no genome fasta file outputs, 6 process cores and making use of the additional BiG-SCAPE redundancy filtering step:

python3 BiG-MAP.family.py -tg 0.1 -th 0.1 -f 0 -g False -p 6 -D /mnt/scratch/usr001/gutSMASH-output/ -b /mnt/scratch/usr001/BiG-SCAPE_location/ -pf /mnt/scratch/usr001/pfam_files_location/  -O /mnt/scratch/usr001/results/

This yields:
BiG-MAP.GCF_HGF.bed = Bedfile to extract core regions in BiG-MAP.map.py
BiG-MAP.GCF_HGF.fna = Reference file to map the WGS reads to
BiG-MAP.GCF_HGF.json = Dictionary that contains the GCFs and HGFs
BiG-MAP.GCF.json = Dictionary that contains the BiG-SCAPE GCFs

3) BiG-MAP.map.py

This module is designed to align the WGS (paired or unpaired) reads to the reference representatives in each GCF and HGF. It does this using bowtie2. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using Bedtools, and the read count values using Samtools. The general usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.map.py -h
python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} {-R [reference] -F [family] | -P [pickled file]} -O [outdir]  [Options*]

To map 10 reads from schirmer et al to the reference representatives from the GCFs and HGFs, and correct for the BiG-SCAPE GCFs, run:

NOTE: It is important for downstream analysis to also use the -b flag.

python3 BiG-MAP.map.py -f False -s fast -th 10 -b /mnt/scratch/usr001/results/schirmer_metadata.txt -cc /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.bed -R /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.fna -I1 /mnt/scratch/usr001/fastq/schirmer/*pass_1* -I2 /mnt/scratch/usr001/fastq/schirmer/*pass_2* -O /mnt/scratch/usr001/results/ -F /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.json -bf /mnt/scratch/usr001/results/BiG-MAP.GCF.json

the schirmer_metadata.txt is set up as follows (tab-delimited):
#run.ID         host.ID	        SampleType	     DiseaseStatus
SRR5947852	C3001C10_MGX	METAGENOMIC	        CD
SRR5947945	C3001C10_MTX	METATRANSCRIPTOMIC	CD
SRR5947826	C3001C5_MGX	METAGENOMIC	        CD
SRR5947900	C3001C5_MTX	METATRANSCRIPTOMIC	CD
SRR5947876	C3001C9_MGX	METAGENOMIC	        CD
SRR5947934	C3001C9_MTX	METATRANSCRIPTOMIC	CD

note the '#' to denote the header row!!!

4) BiG-MAP.analyse.py

This module is a wrapper script for BiG-MAP.norm.R. This R script can also be used locally in R-studio, which is recommended for creating nice visualizations. Although the main set-back is that it requires local installation of all the dependencies, which is taken care of by BiG-MAP_analyse for the command line but not for local R-studio analyses. The comments in the script mention how that works. For example:

Scroll down to the main in BiG-MAP.norm.R
Edit and uncomment:
biom_file <- path/to/biom-file
MT <- condition
sampletype <- "METATRANSCRIPTOMIC" | "METAGENOMIC"
group_1 <- condition_1
group_2 <- condition_2
explore <- TRUE/FALSE

Run all the functions and analyse locally

If you want to do it from the command line (eg in automated analysis), first install all dependencies using the BiG-MAP_process.yml file, if not done already. Then, it works as follows:

python3 BiG-MAP.analyse.py inspect -h
python3 BiG-MAP.analyse.py inspect -B [biom_file] [options*]

Example:
python3 BiG-MAP.analyse.py inspect -B /mnt/scratch/usr001/BiG-MAP.map.biom -e /mnt/scratch/usr001/ -s metagenomic -m DiseaseStatus

Output: 
which conditions can be analysed
heatmap

To perform statistical testing on the biom file, use:

python3 BiG-MAP.analyse.py test -h
python3 BiG-MAP.analyse.py test -B [biom_file] -T [SampleType] -M [meta_group] -G [[groups]] -O [outdir]

Example:
python3 BiG-MAP.analyse.py test -B /mnt/scratch/usr001/BiG-MAP.map.biom -T metagenomic -M DiseaseStatus -G UC non-IBD -O /mnt/scratch/usr001/

Requirements

Input data:

antiSMASH v5.0
gutSMASH

Software:

Python 3+
R statistics
fastq-dump
Mash
HMMer
Bowtie2
Samtools
Bedtools
biom
BiG-SCAPE=20191011

Packages:

Python

BioPython
pandas

R

metagenomeSeq
biomformat
ComplexHeatmap=2.0.0
viridisLite
RColorBrewer
tidyverse

haugustijn / big-map Goto Github PK

big-map's Introduction

The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

Installation

Overview and example run

1) BiG-MAP.download.py

2) BiG-MAP.family.py

3) BiG-MAP.map.py

4) BiG-MAP.analyse.py

Requirements

Input data:

Software:

Packages:

Python

R

big-map's People

Contributors

Stargazers

Watchers

big-map's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs