biocore / horizomer Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 10.0 1.32 MB

Workflow for detecting genome-wide horizontal gene transfers

License: BSD 3-Clause "New" or "Revised" License

Python 82.41% Shell 12.29% Perl 5.09% Makefile 0.20%

horizomer's People

Contributors

Stargazers

Watchers

Forkers

ekopylova antgonza carlyboyd qiyunzhu sjanssen2 josenavas alenzhao iiiime giacomomutti

horizomer's Issues

launch_software.sh readlink & mkdir

Edit lines to following:

working_dir=$(readlink -m $1)
scripts_dir=$(readlink -m $2)

and

mkdir -p "${working_dir}"

distance method code optimizations

normalize_distances(): Create the full placeholder (with nan) before reading PHYLIP MSA's and only update the distance_vector values that are in the file
cluster_distances(): for bitvector in sorted_species_set is called 3 times, possible to merge any of the loops to speedup the code?

Conda recipe for Ranger-DTL

Add to https://github.com/biocore/conda-recipes

Add GenBank output for simulated HGTs

Required as input to EGID

parse HGTs from DarkHorse output

Files to modify:

parse_output.py
run_darkhorse.sh

At the moment launch_software.sh will call run_darkhorse.sh to execute a DIAMOND search (if alignments don't exist) and run DarkHorse to detect HGTs. The resulting summary file from DarkHorse needs to be parsed to output:

putative HGTs based on user defined LPI score bounds
candidate reference genomes IDs (based on LPI score bounds) for shearing the complete species tree

Conda recipe for Jane4

Add to https://github.com/biocore/conda-recipes

add compositional tools

SIGI-HMM, GeneMark, AlienHunter, NearHGT

Mac's readlink compatibility

Mac users should do:

brew install coreutils

Then do:

export PATH=/usr/local/opt/coreutils/libexec/gnubin:$PATH

to get this command to work properly.

Conda recipe for MUSCLE

source: http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz

add to: https://github.com/biocore/conda-recipes

script for preprocessing species & gene trees + BLAST alignments

Write script launch_benchmark.sh which will preprocess the input query genome:

run BLASTP search vs. nr
compute typical/atypical HMM model
compute species tree for candidate related species, generate orthologous gene families, compute gene trees, distance measure and MSAs (using Phylomizer)

Conda recipe for KALIGN

source: http://msa.sbc.su.se/downloads/kalign/

add to https://github.com/biocore/conda-recipes

(WIP) Installation guide for Debian / Ubuntu users

I am testing Horizomer under Ubuntu 16.04. This document records how a user can take advantage of Ubuntu's repository (using sudo apt install ...) to install most of the programs.

Python 3 modules:

sudo apt install python3-skbio python3-biopython python3-click

This includes multiple essential Python modules, such as matplotlib, numpy, scipy, pandas, and nose.
Unfortunately, scikit-bio 0.5.1 or above, required by Horizomer, is only available from Ubuntu 17.04.

Python 2 modules:
Note: Python 2 is required for OrthoFinder and PhyloPhlAn

sudo apt install python-matplotlib python-numpy python-scipy python-dendropy

Python modules for test environment:

sudo apt install flake8

Perl modules
Note: Perl is required for DarkHorse.

sudo apt install bioperl
sudo apt install cpanminus
cpanm DBI DBD::mysql

Basic toolkits for building programs:

sudo apt install build-essential cmake mysql-server

Applications

sudo apt install fasttree kalign mafft mcl muscle ncbi-blast+ phyml prodigal raxml t-coffee

script for swapping genes

Write a script for combining / spiking genes into artificial and genuine genomes to simulate HGTs.

create env variables for PhyloNet & Jane 4 java jar files

to use in launch_software.sh and travis

Merge putative HGTs into summary

At the moment each launch_TOOL.sh script will run the specified software on a complete genome to generate putative HGTs. Need to write a function parse_output.py that will merge the results of all these tools into one summary table.

travis install of dependency software

Add code to install required software (JANE 4, TREX, CONSEL) from INSTALL.md

function to parse gene losses, duplications and donor/recipient info

Edit parse_output.py to collect gene loss, duplications and donor/recipient information for phylogenetic methods. Edit compute_accuracy.py to compute additional precision, recall and F-score for donor/recipient information.

Need to update GenBank parser once scikit-bio is updated

The GenBank format generated by the current version of scikit-bio is not compatible with some programs or libraries (e.g., BioPerl). This issue is expected to be fixed by the scikit-bio team. Once that's done, we need to update this program.

add homology search tools

DarkHorse, HGTector, Distance Method

Conda recipe for PhyloNet

update scikit-bio channel to biocore for GenBank parser

Once it has been merged. See PFC.

(WIP) Re-organize the codebase

Plan for Re-organizing the WGS-HGT codebase

Rationale

The original WGS-HGT was a workflow to benchmark multiple currently available HGT detection tools. Since then, the goal of the project has largely evolved and expanded. Meanwhile, some coding techniques and standards have been updated too. Thus I am planning on re-organizing the codebase to keep it updated.

Here we define that WGS-HGT is a loosen repository that hosts all relevent codes under the larger framework of the "Web of Life" project. Codes live here until they are migrated to more suitable repositories.

Naming

The phrase "WGS-HGT" isn't easy to pronounce, and doesn't precisely describe the current plan of the whole project. People have suggested two candidates:

"weboflife", the same as the project name, least confusing, but when lowercased and merged looks bit awkward.
"horizomer", which indicates that the complete set of horizontally acquired genes in a genome should be called a "horizome", and the goal of the software package is to identify them.

What do you think? Any new ideas are welcome!

Structure

The codebase shall be divided into the following second-level directories under wgshgt:

wrapper: Codes for running third-party programs, reformatting inputs and parsing outputs.
- One program occupies one subdirectory, for modularity purpose. The directory should contain one Python script that provides programming interface for crosstalking with the program, and Bash scripts if necessary.
- Codes that automate the installation of the programs should be included too, but their actual content can be migrated to conda recipes, leaving only interface.
data: Codes for constructing or retrieving data (e.g., random gene shuffler, genome evolution simulator, genome downloader), actual datasets (if small), and descriptions of large, external test datasets (if not automatically retrievable).
reference: Codes for building reference databases, including genome pool, gene family pool, species tree, gene tree, etc. Or just descriptions of reference databases.
- If the scripts have to call external programs to fullfill the function, they should call wrappers rather than launching programs by themselves, unless the programs are very generic (e.g., a GNU tool).
predict: Codes for inferring HGT and other evolutionary events on individual input genomes. These are for end users to analyze their own datasets.
render: Codes for visualizing trees, networks and other forms of display items.
benchmark: Codes for performing benchmark of HGT-prediction methods and other tools.
misc: Codes that cannot fit into existing categories, or codes that have not been sufficiently engineered to live in other directories.

Each directory may contain a tests directory to host unit test scripts. Each tests directory may contains a data directory to store small data files for unit tests. But the unit test codes may also access datasets in first-level data directory.

Because individual steps for predicting, rendering and benchmarking may have to be executed in different work environments, most scripts should have command-line interface (via click).

Please share with people your valuable thoughts. Thank you!

@ekopylova @wasade @RNAer @mortonjt @sjanssen2 @antgonza @tkosciol