GithubHelp home page GithubHelp logo

studentiz / hlpiensemble Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dmitrymyl/hlpiensemble

0.0 1.0 0.0 6.11 MB

A tool for predicting lncRNA-protein interaction probability from sequences.

License: GNU General Public License v3.0

Python 6.10% R 0.86% Shell 0.12% Jupyter Notebook 92.91%

hlpiensemble's Introduction

HLPI-Ensemble

A tool for predicting lncRNA-protein interaction from sequences. Based on http://ccsipb.lnu.edu.cn/hlpiensemble/index.php web server.

Requirements

The package was tested under following dependencies:

Package Version
python 3.6.7
R 3.5.1
R::dplyr 0.7.8
R::caret 6.0.82
R::randomForest 4.6.14
R::xgboost 0.82.1
R::kernlab 0.9.26
R::doParallel 1.0.14
R::foreach 1.4.4
R::iterators 1.0.9

Currently, the package works only under UNIX. Windows users should suggest using WSL or Cygwin.

Installation

One needs git >= 1.8.2 and git-lfs to be installed to deal with large binaries at models/*.Rdata. Run

git clone https://github.com/dmitrymyl/hlpiensemble.git

and files will be fetched via git and git lfs. After that, the package is ready for usage.

Usage

The master script is hlpiensemble.py that allows one to run prediction from any directory. One should do as follows:

python3 path/to/repo/hlpiensemble.py -rna rna.fasta -protein protein.fasta -mode result -output here.csv -taskname some_task

Command line arguments are:

argument type description default
-rna mandatory fasta file containing one or many RNA sequences. Allowed symbols for sequences are A, T, G, C, U. None
-protein mandatory fasta file containing one or many protein sequences. Allowed symbols for sequences are 20 amino acid letter. None
-mode optional Mode of output. If "result", will produce a .csv file. If "full", will produce a directory with all intermediate files. result
-output mandatory Name of the output file/directory. None
-taskname optional Name of task some_task
-cores optional Number of cores to use for prediction 1
--timing optional Whether to profile execution time or not False

How predictions are made

The training dataset was NPInter v2.0 database of lncRNA-protein interactions. Several features (named pse, kmer and acc) are exctracted from sequences and then applied to three mainstream algorithms: RF, SVM, XGBoost.

Execution scheme

The programme works in hlpiensemble directory. It copies sequence files (upcasing sequences and replacing Ts with Us in RNA sequence) to path/to/hlpiensemble/task/taskname directory. Intermediate files and prediction results are produced in the same directory. Completing prediction, programme copies HLPI-Ensemble.csv file to specified place in -output (with renaming) in case of result mode or move the entire directory to specified place (with renaming) in case of full mode. In both modes, task/taskname directory will be deleted.

If one consider parallel execution of hlpiensemble.py script, one should be aware of different names for tasks to prevent overwriting of results.

Output

The output is .csv file containing probabilities of interaction predicted by each algorithm for each pair of RNA and protein.

Time complexity

Testing scheme

It is tested how the length of RNA and protein, the number of sequences and number of cores influence performance. For length, 5 RNAs and 5 proteins are generated with length from 100 to 1000 with step 100, i.e. 10 RNA files and 10 protein files. Then all-to-all runs are performed, i.e. 100 runs. For number, 1 to 10 RNAs and proteins of length of 100 are generated and all-to-all runs are performed, i.e. 100 runs. For parallel processing, 10 RNAs of length of 100 and 10 proteins of length of 100 are taken and then processed with 1 to 10 cores, i.e. 10 runs.

How to reproduce testing

For evaluating execution time one has to generate sequence samples, test them and process results. To do so, one has to do following from the package directory:

cd time_samples
bash generate_samples.sh
cd ../time_results
bash length.sh
bash number.sh
bash parallel.sh

After execution one has to run time_results/time_results.ipynb notebook to produce plots.

Results

Due to 1 run per case there is lack of data and inconsistent results on time performance. It seems like length of sequences and their number do not influence execution time. However, running multiple cores in R negatively influences performance due to costs of parallelism applied.

Contributions

The initial author Fule Liu developed most of the prediction backend. @dmitrymyl adopted web server to CLI usage, including paths tweaks and master script, parallelism and time testing.

Citation

Please cite Huan Hu, Li Zhang, Haixin Ai, Hui Zhang, Yetian Fan, Qi Zhao & Hongsheng Liu (2018) HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy, RNA Biology, 15:6, 797-806, DOI: 10.1080/15476286.2018.1457935 in your paper if you use this software.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.