GithubHelp home page GithubHelp logo

tdencker / multi-spam Goto Github PK

View Code? Open in Web Editor NEW
4.0 0.0 3.0 2.04 MB

Phylogeny reconstruction based on spaced word matches

License: GNU General Public License v3.0

Makefile 0.09% Python 0.41% C++ 11.78% C 86.81% Objective-C 0.71% CMake 0.15% Dockerfile 0.05%
bionformatics alignment-free

multi-spam's Introduction

About

Multi-SpaM is a program to infer phylogenies for a set of genomes. It is based on 4-sets of spaced words that are highly likely to represent true homologies. These blocks are evaluated with the Maximum-Likelihood method RAxML and the resulting trees of 4 sequences, or quartet trees, are amalgamated into a supertree (using the Quartet MaxCut tool). Since the number of blocks used is limited, it is suitable even for large datasets.

Additional information can be found in our paper. The paper has been published at the RECOMB-CG:

Dencker T., Leimeister CA., Gerth M., Bleidorn C., Snir S., Morgenstern B. (2018) Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees. In: Blanchette M., Ouangraoua A. (eds) Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science, vol 11183. Springer, Cham

Installation and Usage

Currently, multi-SpaM is only available on 64-bit linux distributions (due to the limitation of the Quartet MaxCut tool).

In order to install multi-SpaM simply use in the base directory:

$ make

The program itself can then be used via the python script multispam.py. Example usage:

$ python multispam.py -t <number of threads> -i <input file> -o <output file>

where the input file is a FASTA file containing multiple genomes. The output file will be a tree file in newick format.

Options

Option Description
-i Input file in FASTA format
-o Output file in newick format
-w / -k Weight of the pattern (i.e. the number of matching positions) [ can't be larger than 16 ]
-d Number of don't care positions (i.e. the number of positions that don't have to match)
-t Number of threads used
-n Number of sampled blocks
--mem-save Memory save mode (higher runtime, but much less RAM usage for larger files)

Tips:

  • In general, the parameters don't have to be changed. Only the number of threads, input and output need to be specified.
  • If the resulting trees seem unreasonable, you can try lowering the number of don't care positions to 50.
  • In case of large input files, it is recommended to increase the weight to 12 or even higher.
  • Also, if you have rather limited RAM, you can use the memory save mode. For input files larger than 200 mb or so, the required RAM will exceed 8 gb. With the memory saving mode, the RAM requirement could be reduced to 10.5 gb for a 4.8 gb dataset (doubling the runtime).
  • The number of sampled blocks doesn't have to be increased unless (potentially) for very large datasets.

License

Copyright © 2018 - Thomas Dencker License GPLv3+: GNU GPL version 3 or later.

This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. The full license text is available at http://gnu.org/licenses/gpl.html.

Some files may be licensed differently.

Contact

In case of bugs or unexpected errors don't hesitate to send me a mail: [email protected]

multi-spam's People

Contributors

petrovars avatar tdencker avatar

Stargazers

 avatar  avatar  avatar  avatar

multi-spam's Issues

max-cut-tree

There are a few things odd with max-cut-tree:

  • You checked in a binary, which is odd.
  • The binary is 32 bit and doesn't run on my machine, which is unfortunate.
  • The source is missing, which is sad.

Hope you can help me,
Fabian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.