GithubHelp home page GithubHelp logo

aminehb / dsol_rv0.2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sameerkhurana10/dsol_rv0.2

0.0 0.0 0.0 99.16 MB

deep protein solubility prediction

License: MIT License

Python 80.04% Shell 7.15% R 12.81%

dsol_rv0.2's Introduction

DeepSol: A Deep Learning Framework for Sequence-Based Protein Solubility Prediction

DOI

Citing the paper:

@article{khurana2018deepsol,
  title={DeepSol: a deep learning framework for sequence-based protein solubility prediction},
  author={Khurana, Sameer and Rawi, Reda and Kunji, Khalid and Chuang, Gwo-Yu and Bensmail, Halima and Mall, Raghvendra},
  journal={Bioinformatics}
}

alt text

Motivation

Protein solubility can be a decisive factor in both research and production efficiency. Novel in silico, accurate, sequence-based protein solubility predictors are highly sought.

Installation

Requirements

This step will install all the dependencies required for running DeepSol in an Anaconda virtual environment locally. You do not need sudo permissions for this step.

  • Install Anaconda

    1. Download Anaconda (64 bit) installer python3.x for linux : https://www.anaconda.com/download/#linux
    2. Run the installer : bash Anaconda3-2019.03-Linux-x86_64.sh and follow the instructions to install.
    3. Need conda > 4.3.30 (If conda already present but lower than this version do: conda upgrade conda)
  • Creating the environment

    1. Run git clone https://github.com/sameerkhurana10/DSOL_rv0.2.git
    2. Run cd DSOL_rv0.2
    3. Run export PATH=<your_anaconda_folder>/bin:$PATH
    4. Run conda env create -f environment.yml
    5. Run source activate dsol (Running on machine with gpu, additionally do conda install cudnn pygpu libgpuarray)
  • R requirements

    • Run R REPL by running the following: R
    • Install R libraries
      1. Interpol (do install.packages('Interpol') )
      2. bio3d (do install.packages('bio3d') )
      3. doMC (do install.packages('doMC'))

    Quit R REPL: quit()

  • SCRATCH (version SCRATCH-1D release 1.1) (http://scratch.proteomics.ics.uci.edu, Downloads: http://download.igb.uci.edu/#sspro)

    1. Run wget http://download.igb.uci.edu/SCRATCH-1D_1.1.tar.gz
    2. Run tar -xvzf SCRATCH-1D_1.1.tar.gz
    3. Run cd SCRATCH-1D_1.1
    4. Run perl install.pl
    5. Run cd ..

All operations related to DeepSol models are to be performed from the folder DSOL_rv0.2

Run DeepSol on New Test file

To run DeepSol on your own protein sequences you need the following two things:

  1. Protein Sequence File: Protein sequence/sequences of interest in fasta format (https://en.wikipedia.org/wiki/FASTA_format). We provide data/Seq_solo.fasta as an example
  2. SCRATCH: Software used to extract biological features from a given protein sequence file. Follow instructions in the previous section to Install SCRATCH

Execute in the command line

  1. R --vanilla < scripts/PaRSnIP.R data/Seq_solo.fasta <path-to-your-scratch-installation>/bin/run_SCRATCH-1D_predictors.sh new_test 32

32 is the number of processors, new_test is the output files' prefix

Following this step, two files are created in the data folder:

  • new_test_src : contains raw protein sequences
  • new_test_src_bio : contains biological features corresponding to the raw protein sequences

Note: data/Seq_multi.fasta can be used instead of data/Seq_solo.fasta. Seq_multi.fasta has multiple protein sequences

  1. ./run.sh --model deepsol1 --stage 1 --mode preprocess --device cpu --test_file new_test data/newtest.data

Note: If you get an MKL error, do export MKL_THREADING_LAYER=GNU in run.sh

This step Preprocesses data files from step 1, and stores at data/newtest.data in a format acceptable to Deepsol models.

Note: You can also use deepsol2 or deepsol3 in place of deepsol1. See Paper for more details

  1. ./run.sh --model deepsol1 --stage 2 --mode decode --device cpu data/newtest.data

Result will be saved in results/reports/. Note: you can also use deepsol2 or deepsol3 in place of deepsol1.

Recipe for running DeepSol (For all experiments reported in the manuscript)

Recipe is contained in the script run.sh. To see the options run ./run.sh and you shall see the following:

main options (for others, see top of script file)
  --model (deepsol1/deepsol2/deepsol3)               # model architecture to use
  --mode  (preprocess/train/decode/cv)               # data preparation or decode or cross-validate using an existing model
  --stage (1/2)                                      # point to run the script from 
  --conf_file                                        # model parameter file
  --keras_backend                                    # backend for keras
  --cuda_root                                        # the path cuda installation
  --device (cuda/cpu)                                # device to use for running the recipe
  --test_file                                        # name of the new test file

There are two stages in the script.

  1. Data preparation as demonstrated with new test data. Prepared data is already provided in the data folder: protein.data and protein_with_bio.data.
  2. Model building with --mode train and decoding with best DeepSol models using --mode decode. Information about --mode cv is given in "parameter variance check" section.

We provide support for gpu usage using the option --device cuda. More details in the GPU section.

CPU

Train New Models

Train DeepSol models using pre-compiled training, validation data and optimal hyper-parameter setting as in parameters.json file:

  1. ./run.sh --model deepsol1 --stage 2 --mode train --device cpu data/protein.data
  2. ./run.sh --model deepsol2 --stage 2 --mode train --device cpu data/protein_with_bio.data

Result will be a model named deepsol1 or deepsol2 stored in results/models/.

Note that we used --model deepsol2, you can use deepsol3 for step 2. Ignore UserWarning at the output.

Test Best DeepSol Models

Test existing DeepSol models with pre-compiled test data:

  1. ./run.sh --model deepsol1 --stage 2 --mode decode --device cpu data/protein.data
  2. ./run.sh --model deepsol2 --stage 2 --mode decode --device cpu data/protein_with_bio.data

Result will be saved in results/reports/.

Note that we used --model deepsol2, you can use deepsol3 for step 2.

GPU

Cuda installation

Ensure that cuda is installed. We support Cuda 8.0 and Cudnn 5.1 . For any other version of Cudnn, you might run into some issues.

Install Cuda 8.0 and Cudnn 5.1 from https://developer.nvidia.com/

Code was tested against GeForce GTX 1080 Nvidia GPUs https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080/ . The GPU driver version is 384.59.

Code was also tested on Nvidia Tesla K20Xm : https://www.techpowerup.com/gpudb/1884/tesla-k20xm with driver version 375.66.

Train New Models

Train DeepSol models using pre-compiled training, validation data and optimal hyper-parameter setting as in parameters.json file:

  1. ./run.sh --model deepsol1 --stage 2 --mode train --cuda_root <path-to-your-cuda-installation> --device cuda data/protein.data
  2. ./run.sh --model deepsol2 --stage 2 --mode train --cuda_root <path-to-your-cuda-installation> --device cuda data/protein_with_bio.data.

Result will be a model named deepsol1 or deepsol2 stored in results/models.

Note that we used --model deepsol2, you can use deepsol3 for step 2. Ignore UserWarning at the output.

Also, --cuda_root should be the path to your cuda installation. By default it is /usr/local/cuda.

Test Best DeepSol Models

Test existing DeepSol models with pre-compiled test data:

  1. ./run.sh --model deepsol1 --stage 2 --mode decode --cuda_root <path-to-your-cuda-installation> --device cuda data/protein.data
  2. ./run.sh --model deepsol2 --stage 2 --mode decode --cuda_root <path-to-your-cuda-installation> --device cuda data/protein_with_bio.data.

Result will be saved in results/reports/.

Note that we used --model deepsol2, you can use deepsol3 for step 2.

Parameter Variance Check

In this section we calculate the variance in performance of the DeepSol models on 10 cross-validation folds for dataset used in our paper.

For CPU:

  1. run ./run.sh --model deepsol1 --stage 2 --mode cv --device cpu data/protein.data
  2. run ./run.sh --model deepsol2 --stage 2 --mode cv --device cpu data/protein_with_bio.data

For GPU:

  1. run ./run.sh --model deepsol1 --stage 2 --mode cv --cuda_root <path-to-your-cuda-installation> --device cuda data/protein.data
  2. run ./run.sh --model deepsol2 --stage 2 --mode cv --cuda_root <path-to-your-cuda-installation> --device cuda data/protein_with_bio.data

Result will be saved in results/reports/.

Note that we used --model deepsol2, you can use deepsol3 for 2.

FAQ

  1. What all system can your code run on?

A) On most Linux based systems, we tested on Ubuntu 14.04 and 14.10, RedHat 7.4 Maipo and Arch (both cpu and gpu).

  1. How to remove a conda environment?

A) conda remove --name dsol --all

  1. What if I get error error while loading shared libraries: libmpfr.so.4: while installing SCRATCH on Arch ?

A) Do ln -s /usr/lib/libmpfr.so.6.0.0 /usr/lib/libmpfr.so.4. SCRATCH looks for mpfr.so.4 but Arch has a newer version, so we symlink the old location to the new library.

dsol_rv0.2's People

Contributors

raghvendra5688 avatar sameerkhurana10 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.