GithubHelp home page GithubHelp logo

dib-lab / q2-sourmash Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 3.0 19.73 MB

Qiime2 Sourmash Plugin

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.25% Python 97.55% TeX 1.74% Shell 0.46%
sourmash

q2-sourmash's Introduction

QIIME 2 Sourmash Plugin

This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org. For details on sourmash, see http://sourmash.readthedocs.io/.

Installing the QIIME 2 sourmash plugin

q2-sourmash is a QIIME 2 plugin for sourmash, a tool computing and comparing MinHash signatures for nucleotide sequences fast and effieciently. You can find out more about sourmash by reading the paper (Brown and Irber, JOSS 2018) or checking out the sourmash documentation.

You need to have QIIME 2 version 2018.4 or later. Also, regardless of which way you install, you need to be in a QIIME 2 environment for this to work. Install QIIME 2 and activate the QIIME 2 virtual environment (e.g. source activate qiime2-2018.8), and then install sourmash by running:

conda install -c bioconda sourmash

You will also need to install q2-types-genomics (unless your environment already has it):

conda install -c conda-forge -c bioconda -c https://packages.qiime2.org/qiime2/2023.5/tested -c defaults \
    q2-types-genomics

To install the plugin, run the following command:

pip install https://github.com/dib-lab/q2-sourmash/archive/master.zip

To check that the installation worked, type qiime on the command line. The sourmash plugin should show up in the list of available plugins.

Using the QIIME2 sourmash plugin

Currently there are two main methods for use in the QIIME 2 sourmash plugin: compute to calcualte MinHash signatures from nucleotide sequences and compare to calculate a Jaccard distance between samples.

Computing signatures

The compute calcuates the minhash signatures for a given set of nucleotide sequences. To run, one must simply supply a .qza archive (directory) containing sequence file ending with 'fastq.gz'.

First download a test set of fastq.gz files already in the form of a qza archive and the associated metadata. Here we are using data from the Moving Pictures tutorial:

wget -c -nc https://docs.qiime2.org/2018.4/data/tutorials/moving-pictures/demux.qza
wget -c -nc https://data.qiime2.org/2018.8/tutorials/moving-pictures/sample_metadata.tsv 

To calculate sourmash signatures for all sequence files within the archive use the following:

qiime sourmash compute --i-sequence-file demux.qza --p-ksizes 21 --p-scaled 10000 --o-min-hash-signature sigs.qza

The following flags are required:

  • --i-sequence-file : the path to the qza directory
  • --p-ksizes : the k-size of the hash (integer)
  • --p-scaled : the scaled value (integer)
  • --o-min-hash-signature : the output qza file name

The output archive, in this case sigs.qza, contains the signature files for each of the fastq.gz files that were input. They can be viewed using the qiime online viewer or by unzipping the qza file.

qiime tools export --input-path sigs.qza --output-path sigs

Comparing signatures

Signatures that have been calculated as above can then be compared using sourmash compare. This will calculate a pair-wise Jaccard distance between each of the samples included in the provided qza archive:

qiime sourmash compare --i-min-hash-signature sigs.qza --p-ksize 21 --o-compare-output compare.mat.qza

The output, compare.mat.qza, can then be investigated as above by unzipping the qza archive or can be pushed through subsequent analyses (e.g. generate a PCoA plot):

qiime diversity pcoa --i-distance-matrix compare.mat.qza  --o-pcoa pcoa.compare.mat.qza
qiime emperor plot --i-pcoa pcoa.compare.mat.qza --o-visualization emperor.qzv --m-metadata-file sample_metadata.tsv

q2-sourmash's People

Contributors

ctb avatar gregcaporaso avatar halexand avatar jiarong avatar misialq avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

q2-sourmash's Issues

Flag for sourmash compare to change sample names

The integration of the output DistanceMatrix with subsequent QIIME2 tools (e.g. Emperor) would be better facilitated if the naming of the samples in the DistanceMatrix output by sourmash compare could be made to correspond with Sample Metadata tables.

For example with the Moving Pictures Tutorial the sample data table (sample_metadata.tsv):

#SampleID BarcodeSequence LinkerPrimerSequence BodySite Year Month Day Subject ReportedAntibioticUsage DaysSinceExperimentStart Description
#q2:types categorical categorical categorical numeric numericnumeric categorical categorical numeric categorical
L1S8 AGCTGACTAGTC GTGCCAGCMGCCGCGGTAA gut 2008 10 28 subject-1 Yes 0 subject-1.gut.2008-10-28
L1S57 ACACACTATGGC GTGCCAGCMGCCGCGGTAA gut 2009 1 20 subject-1 No 84 subject-1.gut.2009-1-20
L1S76 ACTACGTGTGGT GTGCCAGCMGCCGCGGTAA gut 2009 2 17 subject-1 No 112 subject-1.gut.2009-2-17

But the distance matrix output by sourmash is listed by the file paths:

/var/folders/r9/86xmxscs2fx26664gnv0wk3h0000gn/T/q2-MinHashSigJsonDirFormat-gkquzqjf/L1S76_12_L001_R1_001.fastq.gz 0.9375 0.8333333333333334 0.8235294117647058 0.8571428571428572 0.8666666666666667 0.6363636363636364 0.0 0.8666666666666667 1.0 1.0 0.9583333333333334 0.9565217391304348 1.0 1.0 0.9333333333333333 1.0 0.9090909090909091 1.0 1.0 0.9333333333333333 1.0 0.8 0.9722222222222222 1.0 0.9722222222222222 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
/var/folders/r9/86xmxscs2fx26664gnv0wk3h0000gn/T/q2-MinHashSigJsonDirFormat-gkquzqjf/L1S8_8_L001_R1_001.fastq.gz 0.9411764705882353 0.9285714285714286 0.8333333333333334 0.8666666666666667 0.9411764705882353 0.8571428571428572 0.8666666666666667 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.9166666666666666 1.0 1.0 1.0 1.0 0.9166666666666666 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.