GithubHelp home page GithubHelp logo

big-map's Introduction

The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

https://github.com/HAugustijn/BiG-MAP2/blob/master/Pipeline_overview.png

This is the Github repository for the Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP). For the analysis of bacterial metagenomic and metatranscriptomic samples more and more tools become available, although these tools are not capable of profiling specific metabolic gene clusters (MGCs), that have been shown to be major phenotype drivers. Therefore, this tool is focussed on finding the representation of MGCs and their related homologs in metagenomic and metatranscriptomic samples. These pathways are readily obtained from (draft) bacterial genomes using antiSMASH or gutSMASH. To be able to process the outputs from these tools into proper abundance and expressions values, the following programs form the essential part of BiG-MAP:

  • BiG-MAP.download.py
  • BiG-MAP.family.py
  • BiG-MAP.map.py
  • BiG-MAP.analyse.py

For information on how to implement this program, scroll down to Overview and example run.

Installation

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/HAugustijn/BiG-MAP.git

Then install all the dependencies from the BiG-MAP.yml file with:

# For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py
~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process
~$ conda activate BiG-MAP_process

# For BiG-MAP.analyse.py
~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse
~$ conda activate BiG-MAP_analyse

To make use of the second redundancy filtering step, download BiG-SCAPE using:

~$ git clone https://git.wageningenur.nl/medema-group/BiG-SCAPE

After this all the dependencies are installed. BiG-MAP can now be used.

Overview and example run

A typical workflow for BiG-MAP consists of the following 4 consecutive steps:

  1. Downloading WGS data using BiG-MAP.download.py
  2. Generating gene cluster families (GCFs) and housekeeping gene families (HGFs) using BiG-MAP.family.py
  3. Computing abundance and expression profiles of selected representatives from each GCF and HGF using BiG-MAP.map.py
  4. Analysing the resulting BIOM file for profiles using BiG-MAP.analyse.py

The four steps are described below, and for each an example is provided.

1) BiG-MAP.download.py

This script is created to easily download the metagenomic and/or metatranscriptomic samples from the online NCBI repository. First, the samples are downloaded in .SRA format, and then they are converted into .fastq pairs using fastq-dump.

conda activate BiG-MAP_process
python3 BiG-MAP.download.py -h
python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]

To download the samples, go to the SRA run selector and fill in the study code. For the IBD-cohort of schirmer et al. (2018) that is PRJNA389280. Next, select the accessions and click Accession List to download the accessions. Use this accession file in the following command:

python3 BiG-MAP.download.py -A Acc_list.txt -O /mnt/scratch/usr001/fastq/schirmer/

Acc_list.txt:
SRR5983273
SRR5983265
SRR5983266
SRR5983268
SRR5983270
SRR5983271
SRR5983275
...

2) BiG-MAP.family.py

The main purpose of this script is to compute GCFs and HGFs using sequence similarity as sole metric. For GCF computation, protein sequences are used while for the HGF computation DNA sequences are used. Mash is implemented to compute the GCFs and HGFs. The input consists of the output directories of anti- or gutSMASH. Options can be investigated by running the -h flag. General usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.family.py -h
python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]

In the example of a gutSMASH run on 1520 (draft) reference genomes that are present in the gut, with a Mash treshold of 0.1 for GFCs and 0.1 for HGFs, no flanking genes of the core, no genome fasta file outputs, 6 process cores and making use of the additional BiG-SCAPE redundancy filtering step:

python3 BiG-MAP.family.py -tg 0.1 -th 0.1 -f 0 -g False -p 6 -D /mnt/scratch/usr001/gutSMASH-output/ -b /mnt/scratch/usr001/BiG-SCAPE_location/ -pf /mnt/scratch/usr001/pfam_files_location/  -O /mnt/scratch/usr001/results/

This yields:
BiG-MAP.GCF_HGF.bed = Bedfile to extract core regions in BiG-MAP.map.py
BiG-MAP.GCF_HGF.fna = Reference file to map the WGS reads to
BiG-MAP.GCF_HGF.json = Dictionary that contains the GCFs and HGFs
BiG-MAP.GCF.json = Dictionary that contains the BiG-SCAPE GCFs

3) BiG-MAP.map.py

This module is designed to align the WGS (paired or unpaired) reads to the reference representatives in each GCF and HGF. It does this using bowtie2. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using Bedtools, and the read count values using Samtools. The general usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.map.py -h
python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} {-R [reference] -F [family] | -P [pickled file]} -O [outdir]  [Options*]

To map 10 reads from schirmer et al to the reference representatives from the GCFs and HGFs, and correct for the BiG-SCAPE GCFs, run:

NOTE: It is important for downstream analysis to also use the -b flag.

python3 BiG-MAP.map.py -f False -s fast -th 10 -b /mnt/scratch/usr001/results/schirmer_metadata.txt -cc /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.bed -R /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.fna -I1 /mnt/scratch/usr001/fastq/schirmer/*pass_1* -I2 /mnt/scratch/usr001/fastq/schirmer/*pass_2* -O /mnt/scratch/usr001/results/ -F /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.json -bf /mnt/scratch/usr001/results/BiG-MAP.GCF.json

the schirmer_metadata.txt is set up as follows (tab-delimited):
#run.ID         host.ID	        SampleType	     DiseaseStatus
SRR5947852	C3001C10_MGX	METAGENOMIC	        CD
SRR5947945	C3001C10_MTX	METATRANSCRIPTOMIC	CD
SRR5947826	C3001C5_MGX	METAGENOMIC	        CD
SRR5947900	C3001C5_MTX	METATRANSCRIPTOMIC	CD
SRR5947876	C3001C9_MGX	METAGENOMIC	        CD
SRR5947934	C3001C9_MTX	METATRANSCRIPTOMIC	CD

note the '#' to denote the header row!!!

4) BiG-MAP.analyse.py

This module is a wrapper script for BiG-MAP.norm.R. This R script can also be used locally in R-studio, which is recommended for creating nice visualizations. Although the main set-back is that it requires local installation of all the dependencies, which is taken care of by BiG-MAP_analyse for the command line but not for local R-studio analyses. The comments in the script mention how that works. For example:

Scroll down to the main in BiG-MAP.norm.R
Edit and uncomment:
biom_file <- path/to/biom-file
MT <- condition
sampletype <- "METATRANSCRIPTOMIC" | "METAGENOMIC"
group_1 <- condition_1
group_2 <- condition_2
explore <- TRUE/FALSE

Run all the functions and analyse locally

If you want to do it from the command line (eg in automated analysis), first install all dependencies using the BiG-MAP_process.yml file, if not done already. Then, it works as follows:

python3 BiG-MAP.analyse.py inspect -h
python3 BiG-MAP.analyse.py inspect -B [biom_file] [options*]

Example:
python3 BiG-MAP.analyse.py inspect -B /mnt/scratch/usr001/BiG-MAP.map.biom -e /mnt/scratch/usr001/ -s metagenomic -m DiseaseStatus

Output: 
which conditions can be analysed
heatmap

To perform statistical testing on the biom file, use:

python3 BiG-MAP.analyse.py test -h
python3 BiG-MAP.analyse.py test -B [biom_file] -T [SampleType] -M [meta_group] -G [[groups]] -O [outdir]

Example:
python3 BiG-MAP.analyse.py test -B /mnt/scratch/usr001/BiG-MAP.map.biom -T metagenomic -M DiseaseStatus -G UC non-IBD -O /mnt/scratch/usr001/

Requirements

Input data:

  • antiSMASH v5.0
  • gutSMASH

Software:

  • Python 3+
  • R statistics
  • fastq-dump
  • Mash
  • HMMer
  • Bowtie2
  • Samtools
  • Bedtools
  • biom
  • BiG-SCAPE=20191011

Packages:

Python

  • BioPython
  • pandas

R

  • metagenomeSeq
  • biomformat
  • ComplexHeatmap=2.0.0
  • viridisLite
  • RColorBrewer
  • tidyverse

big-map's People

Contributors

koenvberg avatar haugustijn avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

big-map's Issues

-R command not recognised

I've been trying to run the .map step in Big-MAP but with no luck.
In your code there is the command -R, which is not recognised by the program. I tried the -P pickle file option instead, but I returned the error: The file names are not overlapping with the names in the metadata file. Please provide matching file names.

looking into the output from big-MAP.family the file names are different to those illustrated by the BiG-MAP readme:

BiG-MAP.dist_GC.json
BiG-MAP.GCF.bed
Big-MAP.GCF.fna
Big-MAP.GCs.json
BiG-MAP.pickle
mash_output_GC.tab
mash_sketch.msh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.