biocore / horizomer Goto Github PK
View Code? Open in Web Editor NEWWorkflow for detecting genome-wide horizontal gene transfers
License: BSD 3-Clause "New" or "Revised" License
Workflow for detecting genome-wide horizontal gene transfers
License: BSD 3-Clause "New" or "Revised" License
Edit lines to following:
working_dir=$(readlink -m $1)
scripts_dir=$(readlink -m $2)
and
mkdir -p "${working_dir}"
normalize_distances()
: Create the full placeholder (with nan
) before reading PHYLIP MSA's and only update the distance_vector
values that are in the filecluster_distances()
: for bitvector in sorted_species_set
is called 3 times, possible to merge any of the loops to speedup the code?Required as input to EGID
Files to modify:
parse_output.py
run_darkhorse.sh
At the moment launch_software.sh
will call run_darkhorse.sh
to execute a DIAMOND search (if alignments don't exist) and run DarkHorse to detect HGTs. The resulting summary file from DarkHorse needs to be parsed to output:
SIGI-HMM, GeneMark, AlienHunter, NearHGT
Mac users should do:
brew install coreutils
Then do:
export PATH=/usr/local/opt/coreutils/libexec/gnubin:$PATH
to get this command to work properly.
Write script launch_benchmark.sh which will preprocess the input query genome:
I am testing Horizomer under Ubuntu 16.04. This document records how a user can take advantage of Ubuntu's repository (using sudo apt install ...
) to install most of the programs.
Python 3 modules:
sudo apt install python3-skbio python3-biopython python3-click
This includes multiple essential Python modules, such as matplotlib, numpy, scipy, pandas, and nose.
Unfortunately, scikit-bio 0.5.1 or above, required by Horizomer, is only available from Ubuntu 17.04.
Python 2 modules:
Note: Python 2 is required for OrthoFinder and PhyloPhlAn
sudo apt install python-matplotlib python-numpy python-scipy python-dendropy
Python modules for test environment:
sudo apt install flake8
Perl modules
Note: Perl is required for DarkHorse.
sudo apt install bioperl
sudo apt install cpanminus
cpanm DBI DBD::mysql
Basic toolkits for building programs:
sudo apt install build-essential cmake mysql-server
Applications
sudo apt install fasttree kalign mafft mcl muscle ncbi-blast+ phyml prodigal raxml t-coffee
Write a script for combining / spiking genes into artificial and genuine genomes to simulate HGTs.
to use in launch_software.sh and travis
At the moment each launch_TOOL.sh
script will run the specified software on a complete genome to generate putative HGTs. Need to write a function parse_output.py
that will merge the results of all these tools into one summary table.
Add code to install required software (JANE 4, TREX, CONSEL) from INSTALL.md
Edit parse_output.py to collect gene loss, duplications and donor/recipient information for phylogenetic methods. Edit compute_accuracy.py to compute additional precision, recall and F-score for donor/recipient information.
The GenBank format generated by the current version of scikit-bio is not compatible with some programs or libraries (e.g., BioPerl). This issue is expected to be fixed by the scikit-bio team. Once that's done, we need to update this program.
DarkHorse, HGTector, Distance Method
Once it has been merged. See PFC.
The original WGS-HGT was a workflow to benchmark multiple currently available HGT detection tools. Since then, the goal of the project has largely evolved and expanded. Meanwhile, some coding techniques and standards have been updated too. Thus I am planning on re-organizing the codebase to keep it updated.
Here we define that WGS-HGT is a loosen repository that hosts all relevent codes under the larger framework of the "Web of Life" project. Codes live here until they are migrated to more suitable repositories.
The phrase "WGS-HGT" isn't easy to pronounce, and doesn't precisely describe the current plan of the whole project. People have suggested two candidates:
What do you think? Any new ideas are welcome!
The codebase shall be divided into the following second-level directories under wgshgt
:
wrapper
: Codes for running third-party programs, reformatting inputs and parsing outputs.
data
: Codes for constructing or retrieving data (e.g., random gene shuffler, genome evolution simulator, genome downloader), actual datasets (if small), and descriptions of large, external test datasets (if not automatically retrievable).reference
: Codes for building reference databases, including genome pool, gene family pool, species tree, gene tree, etc. Or just descriptions of reference databases.
predict
: Codes for inferring HGT and other evolutionary events on individual input genomes. These are for end users to analyze their own datasets.render
: Codes for visualizing trees, networks and other forms of display items.benchmark
: Codes for performing benchmark of HGT-prediction methods and other tools.misc
: Codes that cannot fit into existing categories, or codes that have not been sufficiently engineered to live in other directories.Each directory may contain a tests
directory to host unit test scripts. Each tests
directory may contains a data
directory to store small data files for unit tests. But the unit test codes may also access datasets in first-level data
directory.
Because individual steps for predicting, rendering and benchmarking may have to be executed in different work environments, most scripts should have command-line interface (via click
).
Please share with people your valuable thoughts. Thank you!
@ekopylova @wasade @RNAer @mortonjt @sjanssen2 @antgonza @tkosciol
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.