GithubHelp home page GithubHelp logo

crazyinter / rnn-virseeker Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 5.0 1.82 MB

This is a deep learning method for identification of viral contigs with short length from metagenomic data.

Python 100.00%

rnn-virseeker's Introduction

RNN-VirSeeker

Version 1.1
Authors: Yan Miao, Fu Liu, Yun Liu
Maintainer: Yan Miao [email protected]

Description

This package provides a deep learning method for identification of viral contigs from metagenomic data in a fasta file. The method has the ability to identify viral contigs with short length (<500bp) from metagenomic data.

The prediction model is a deep learning based Recurrent Neural Network (RNN) that learns the high-level features of each contig to distinguish virus from host sequences. The model was trained using equal number of known viral and host sequences from NCBI RefSeq database. Before training, those known sequences were firstly split into a number of non-overlapping contigs with a length of 500bp and then were encoded by transforming (A, T, G, C) to (1, 2, 3, 4), respectively. For a query sequence shorter than 500bp, it should be first zero-padded up to 500bp. Then the sequence is predicted by the RNN model trained with previously known sequences.

Dependencies

To utilize RNN-VirSeeker, Python packages "sklearn", "numpy" and "matplotlib" are needed to be previously installed.

In convenience, download Anaconda from https://repo.anaconda.com/archive/, which contains most of needed packages.

To insatll tensorflow, start "cmd.exe" and enter

pip install tensorflow

Our codes were all edited by Python 3.6.5 with TensorFlow 1.3.0.

Usage

It is simple to use RNN-VirSeeker for users' database.
There are two ways for users to train the model using train.py.

  • Using our original training database (containing 4500 viral sequences and 4500 host sequences of length 500bp) "rnn_train.csv".
    Users can utilize the trained model directly to test query contigs. Or you can make some changes to the hyperparameters, and then retrain the model.
  • Using users' own database in a ".csv" format.
    • Firstly, chose a set of hyperparameters to train your dataset.
    • Secondly, train and refine your model using your dataset according to the performance on a related validation dataset.
    • Finally, utilize the saved well trained model to identify query contigs. Note: Before training, set the path to where the database is located. All labels should be encoded to one-hot labels.

To make a prediction, users' own query contigs should be edited into a ".csv" file, where every line contains a single query contig. Through test.py, RNN-VirSeeker will give a set of scores to each query contig, higher of which represents its classification result.

Copyright and License Information

Copyright (C) 2019 Jilin University

Authors: Yan Miao, Fu Liu, Yun Liu

This program is freely available as Python at https://github.com/crazyinter/RNN-VirSeeker.

Commercial users should contact Mr. Miao at [email protected], copyright at Jilin University.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

rnn-virseeker's People

Contributors

crazyinter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

rnn-virseeker's Issues

Missing Shebang Lines

Both of the executable files (i.e. train.py and test.py) do not have a shebang line. While is it not critical
that a shebang line be included, it may be advisable unless there is some reason for not including it.

This could be easily resolved by making the first line of those files:
#!/usr/bin/env python3
With that, users would no longer need to run $ python3 train.py, but could just execute using $ ./train.py. Also, it makes it explicit that the script is to be run with Python 3 instead of Python 2.

test.py Not a Test Script

Typically, scripts named test.py or test_*.py are reserved for testing (e.g. via pytest). It is confusing to have the classification script named test.py, since it is not testing anything.

Please clarify if the above is incorrect. I have based this on the README.md:

To make a prediction, users' own query contigs should be edited into a ".csv" file, where every line contains a single query contig. Through test.py, RNN-VirSeeker will give a set of scores to each query contig, higher of which represents its classification result.

Which to me indicates that test.py is run to classify one's own contigs using the trained model.

Hard-coded Directories and File Names, Not Scalable

It seems that the code must be edited in several ways for it to function:

  • Firstly, the directory where the .csv files are must be set so that the command os.chdir('dir') doesn't result in the error No such file or directory: 'dir'.
  • Next, the names of the relevant files must be changed manually (e.g. the training data and target data files).

Users should not have to change source code for it to work. I would suggest utilizing arguments, so that the necessary paths can be passed to the script when it is run. argparse is great for this purpose.

As it stands, the code cannot be run upon cloning, since there is no 'dir' directory. It also is not scalable because users will likely want to run the classification on many datasets (myself included) and we need to automate the way we call this tool. It is infeasible to edit the code for each new dataset. Keep in mind, we are users of the tool and so should not have to change the tool's code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.