GithubHelp home page GithubHelp logo

arthurvm / treemer Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 771 KB

A simple tool to generate hierarchical clustering trees from nucleotide sequences. Supports a number of distance metrics and clustering algorithms. Includes a large testset of SARSCOV2 genomes.

License: GNU General Public License v3.0

Python 78.23% C++ 21.77%
hierarchical-clustering-trees distance sarscov2 bioinfo phylogenetic-trees genome-analysis kmer-composition

treemer's Introduction

TreeMer

A simple tool to generate hierarchical clustering trees from nucleotide sequences using kmer spectra distance. Included is a small testset of SARSCOV2 genomes downloaded from https://www.nlm.nih.gov/news/coronavirus_genbank.html.

Overview

This tool calculates the distance between a set of nucleotide sequences in FASTA format by digesting them into kmer count vectors (effectively kmer spectra). The pairwise distance between all pairs of vectors are calculated and clustered to build a Hierarchical clustering tree. A number of distance metrics and clustering methods are supported (see distance and clustering).

Installation

Installation is very straightforward, simply run

git clone [email protected]:ArthurVM/TreeMer.git
cd TreeMer
python3 -m pip install -d dependencies.txt

and you are good to go!

Input

TreeMer takes kmer a set of nucleotide sequences in FASTA format, and generates kmer count files, stuctured as:

kmer0 count
kmer1 count
...
kmern count

in tab seperated format (denoting the kmer spectrum of the sequence). These kmer spectra are used to distance vector, and a Hierarchical Clustering tree generated.

Output

TreeMer outputs the following files:

HC_dendro.png     - The hierarchical clustering dendrogram in .png format.
HC_tree.nwk       - A text file containing the hierarchical clustering tree in Newick format. 
heatmap.png       - The heatmap of sequence distances in .png format.
heatmap.{D}.tsv   - A heatmap file in .tsv format. {D} is the distance metric used. 

Usage

usage: TreeMer.py [-h] [-i I I] [-k K] [-m M] [-s]
                  [-d {distance metric}}]
                  [-c {clustering method}]
                  [-g G]
                  [fa_files [fa_files ...]]

positional arguments:
  fa_files              An arbitrary number of sequence files in FASTA format.

optional arguments:
  -h, --help            show this help message and exit
  -i I I                Lower and upper bound percentiles to construct the
                        tree. E.g. 25 75 will generate a tree from kmers from
                        the 25th to the 75th percentiles in the total set of
                        kmers ordered by count.
  -k K                  Kmer size to use in constructing genome comparison.
                        Default=7.
  -m M                  The maximum count to return a kmer, e.g. return only
                        kmers with count <=10 if m=10. Default=return ALL.
  -s                    Suppress the generation of kmer-spectra from sequence
                        files. This assumes that all positional arguments
                        provided to this tool are already kmer-spectra files
                        generated by genKmerCount. Default=False.
  -d {euclidean,minkowski,cityblock,sqeuclidean,hamming,jaccard,chebyshev,canberra,braycurtis,yule}
                        Metric used in calculating distance between kmer
                        spectra. Default=euclidean.
  -c {ward,single,complete,average,weighted,centroid,median}
                        Clustering method utilised to build the tree.
                        Default=ward.
  -g G                  A tab seperated text file containing geographic
                        locations for each sequence, ith the sequence ID in
                        col0 an geolocation in col1. Default=False.
  -v                    Verbose output mode. Default=False.

Example Using SARSCOV2 Dataset

A dataset of complete SARSCOV2 genomes are provided with this tool, in the /TreeMer/SARSCOV2/SARSCOV2_WGS directory. This includes geolocations of each isolate in /TreeMer/SARSCOV2/geolocs.tsv.

The entire pipeline can be run using a single command fromthe TreeMer root directory:

python3 TreeMer.py SARSCOV2/SARSCOV2_WGS/* -k 7 -i 10 90 -d euclidean -c ward -g SARSCOV2/geolocs.tsv

In this instance, we are calculating the euclidean distance between 7mer frequency vectors, stripping out the 10% least and most frequent kmers, and clustered using Wards method. The subsiquent tree is: Euclidean HC Ward clustering dendrogram SARSCOV2

Distance and Clustering

A number of distance metrics and clustering methods are supported by this tool.

Distance Metrics

  • Euclidean
  • Minkowski
  • Cityblock
  • Sqeuclidean
  • Hamming
  • Jaccard
  • Chebyshev
  • Canberra
  • Bradycurtis
  • Yule

Clustering Methods

  • Ward
  • Single
  • Complete
  • Average
  • Weighted
  • Centroid
  • Median

Dependencies

python3
argparse
scipy
numpy
matplotlib
seaborn

treemer's People

Contributors

arthurvm avatar

Forkers

jiangchb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.