GithubHelp home page GithubHelp logo

biocore / horizomer Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 10.0 1.32 MB

Workflow for detecting genome-wide horizontal gene transfers

License: BSD 3-Clause "New" or "Revised" License

Python 82.41% Shell 12.29% Perl 5.09% Makefile 0.20%

horizomer's People

Contributors

antgonza avatar ekopylova avatar iiiime avatar josenavas avatar mortonjt avatar qiyunzhu avatar rnaer avatar sjanssen2 avatar wasade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

horizomer's Issues

distance method code optimizations

  1. normalize_distances(): Create the full placeholder (with nan) before reading PHYLIP MSA's and only update the distance_vector values that are in the file
  2. cluster_distances(): for bitvector in sorted_species_set is called 3 times, possible to merge any of the loops to speedup the code?

parse HGTs from DarkHorse output

Files to modify:

parse_output.py
run_darkhorse.sh

At the moment launch_software.sh will call run_darkhorse.sh to execute a DIAMOND search (if alignments don't exist) and run DarkHorse to detect HGTs. The resulting summary file from DarkHorse needs to be parsed to output:

  1. putative HGTs based on user defined LPI score bounds
  2. candidate reference genomes IDs (based on LPI score bounds) for shearing the complete species tree

Mac's readlink compatibility

Mac users should do:

brew install coreutils

Then do:

export PATH=/usr/local/opt/coreutils/libexec/gnubin:$PATH

to get this command to work properly.

script for preprocessing species & gene trees + BLAST alignments

Write script launch_benchmark.sh which will preprocess the input query genome:

  1. run BLASTP search vs. nr
  2. compute typical/atypical HMM model
  3. compute species tree for candidate related species, generate orthologous gene families, compute gene trees, distance measure and MSAs (using Phylomizer)

(WIP) Installation guide for Debian / Ubuntu users

I am testing Horizomer under Ubuntu 16.04. This document records how a user can take advantage of Ubuntu's repository (using sudo apt install ...) to install most of the programs.

Python 3 modules:

sudo apt install python3-skbio python3-biopython python3-click

This includes multiple essential Python modules, such as matplotlib, numpy, scipy, pandas, and nose.
Unfortunately, scikit-bio 0.5.1 or above, required by Horizomer, is only available from Ubuntu 17.04.

Python 2 modules:
Note: Python 2 is required for OrthoFinder and PhyloPhlAn

sudo apt install python-matplotlib python-numpy python-scipy python-dendropy

Python modules for test environment:

sudo apt install flake8

Perl modules
Note: Perl is required for DarkHorse.

sudo apt install bioperl
sudo apt install cpanminus
cpanm DBI DBD::mysql

Basic toolkits for building programs:

sudo apt install build-essential cmake mysql-server

Applications

sudo apt install fasttree kalign mafft mcl muscle ncbi-blast+ phyml prodigal raxml t-coffee

script for swapping genes

Write a script for combining / spiking genes into artificial and genuine genomes to simulate HGTs.

Merge putative HGTs into summary

At the moment each launch_TOOL.sh script will run the specified software on a complete genome to generate putative HGTs. Need to write a function parse_output.py that will merge the results of all these tools into one summary table.

Need to update GenBank parser once scikit-bio is updated

The GenBank format generated by the current version of scikit-bio is not compatible with some programs or libraries (e.g., BioPerl). This issue is expected to be fixed by the scikit-bio team. Once that's done, we need to update this program.

(WIP) Re-organize the codebase

Plan for Re-organizing the WGS-HGT codebase

Rationale

The original WGS-HGT was a workflow to benchmark multiple currently available HGT detection tools. Since then, the goal of the project has largely evolved and expanded. Meanwhile, some coding techniques and standards have been updated too. Thus I am planning on re-organizing the codebase to keep it updated.

Here we define that WGS-HGT is a loosen repository that hosts all relevent codes under the larger framework of the "Web of Life" project. Codes live here until they are migrated to more suitable repositories.

Naming

The phrase "WGS-HGT" isn't easy to pronounce, and doesn't precisely describe the current plan of the whole project. People have suggested two candidates:

  1. "weboflife", the same as the project name, least confusing, but when lowercased and merged looks bit awkward.
  2. "horizomer", which indicates that the complete set of horizontally acquired genes in a genome should be called a "horizome", and the goal of the software package is to identify them.

What do you think? Any new ideas are welcome!

Structure

The codebase shall be divided into the following second-level directories under wgshgt:

  • wrapper: Codes for running third-party programs, reformatting inputs and parsing outputs.
    • One program occupies one subdirectory, for modularity purpose. The directory should contain one Python script that provides programming interface for crosstalking with the program, and Bash scripts if necessary.
    • Codes that automate the installation of the programs should be included too, but their actual content can be migrated to conda recipes, leaving only interface.
  • data: Codes for constructing or retrieving data (e.g., random gene shuffler, genome evolution simulator, genome downloader), actual datasets (if small), and descriptions of large, external test datasets (if not automatically retrievable).
  • reference: Codes for building reference databases, including genome pool, gene family pool, species tree, gene tree, etc. Or just descriptions of reference databases.
    • If the scripts have to call external programs to fullfill the function, they should call wrappers rather than launching programs by themselves, unless the programs are very generic (e.g., a GNU tool).
  • predict: Codes for inferring HGT and other evolutionary events on individual input genomes. These are for end users to analyze their own datasets.
  • render: Codes for visualizing trees, networks and other forms of display items.
  • benchmark: Codes for performing benchmark of HGT-prediction methods and other tools.
  • misc: Codes that cannot fit into existing categories, or codes that have not been sufficiently engineered to live in other directories.

Each directory may contain a tests directory to host unit test scripts. Each tests directory may contains a data directory to store small data files for unit tests. But the unit test codes may also access datasets in first-level data directory.

Because individual steps for predicting, rendering and benchmarking may have to be executed in different work environments, most scripts should have command-line interface (via click).

Please share with people your valuable thoughts. Thank you!

@ekopylova @wasade @RNAer @mortonjt @sjanssen2 @antgonza @tkosciol

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.