GithubHelp home page GithubHelp logo

srisvs33 / camitax Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cami-challenge/camitax

0.0 0.0 0.0 148.26 MB

CAMITAX: Taxon labels for microbial genomes

Home Page: https://doi.org/10.1101/532473

License: Apache License 2.0

Nextflow 57.86% Python 42.14%

camitax's Introduction

CAMITAX: Taxon labels for microbial genomes

The CAMITAX taxonomic assignment workflow

The CAMITAX taxonomic assignment workflow. CAMITAX assigns one NCBI Taxonomy ID (taxID) to an input genome G by combining genome distance-, 16S rRNA gene-, and gene homology-based taxonomic assignments with phylogenetic placement. (A) Genome distance-based assignment. CAMITAX uses Mash to estimate the average nucleotide identity (ANI) between G and more than a hundred thousand microbial genomes in RefSeq, and assigns the lowest common ancestor (LCA) of genomes showing >95% ANI, which was found to be a clear species boundary. (B) 16S rRNA gene-based assignment. CAMITAX uses Dada2 to label G's 16S rRNA gene sequences using the naïve Bayesian classifier method to assign taxonomy across multiple ranks (down to genus level), and exact sequence matching for species-level assignments, against the SILVA or RDP database. (C) Gene homology-based assignments. CAMITAX uses Centrifuge and Kaiju to perform gene homology searches against nucleotide and amino acid sequences in NCBI's nr and nt (or proGenomes' genes and proteins datasets), respectively. CAMITAX determines the interval-union LCA (iuLCA) of gene-level assignments and places G on the lowest taxonomic node with at least 50% coverage. (D) Phylogenetic placement. CAMITAX uses Pplacer to place G onto a fixed reference tree, as implemented in CheckM, and estimates genome completeness and contamination using lineage-specific marker genes. (E) Classification algorithm. CAMITAX considers the lowest consistent assignment as the longest unambiguous root-to-node path in the taxonomic tree spanned by the five taxIDs derived in (A)–(D), i.e. it retains the most specific, yet consistent taxonomic label among all tools.

Requirements

All you need is Nextflow and Docker (or Singularity). This is the recommended way to run CAMITAX and (by default) CAMITAX requires 8 CPU cores and 24 GB of memory.

Plan B: You may run CAMITAX without software containers. However, this is not recommended and you have to install all software dependencies by yourself.

If you need any help or further guidance: Please get in touch!

User Guide

Installation

CAMITAX relies on multiple reference databases (which we do not bundle by default, due to their sheer size). You can either build them from scratch or simply use the latest of our "official" releases. To do so, please run:

nextflow pull CAMI-challenge/CAMITAX
nextflow run CAMI-challenge/CAMITAX/init.nf --db /path/to/db/folder

Warning: This will download ~30 GB of data, expect this to run a while! /path/to/db/folder should have >100 GB of available disk space. Note that you have to do this only once; specify the location in all future CAMITAX runs.

Warning: To foster reproducibility, we strongly recommend that you use our "official" releases and we will continue to provide stable and versioned updates in the future.

Input

CAMITAX expects all input genomes in (genomic/nucleotide multi-)FASTA format. If your input genomes are in the folder input/ with file extension .fasta, please run:

nextflow run CAMI-challenge/CAMITAX -profile docker --db /path/to/db/folder --i input --x fasta

If you want to use Singularity instead of Docker (without sudo), please replace docker with singularity as profile.

Output

CAMITAX outputs a tab-seperated file camitax.tsv containing the individual taxon assignments in the data folder.

Warning: While CAMITAX is built around computational reproducibility, results might sometimes be slightly different from run to run (but of comparable quality) because software used within (e.g. CheckM) are non-deterministic.

Again, if you need any help or further guidance: Please get in touch!

Citation

  • Bremges, Fritz & McHardy (2020). CAMITAX: Taxon labels for microbial genomes. GigaScience, 9, 1:1–7. doi:10.1093/gigascience/giz154
  • Sczyrba, Hofmann, Belmann, et al. (2017). Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nature Methods, 14, 11:1063–1071. doi:10.1038/nmeth.4458

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.