GithubHelp home page GithubHelp logo

protein-domain-protocnn's Introduction

About

Protein domain prediction (for over 17000+ types of protein strings)

How to run:

Downloading

Download data

Download the data from kaggle and place it in a folder. It's expected that the folder structure is as follows: --data_dir = "data/random_split"

data
├── random_split
│   ├── dev
│   ├── test
│   ├── train

Download Language Encoder

Download the language encoder from here and place it in a folder.
--lang_params = "path/to/lang_params/sample.pickle"

Download model checkpoints

Download any model checkpoint and put them in a folder. You can then specify the parameters as follows in scripts.

--model_checkpoint = "path/to/model_weights/sample.ckpt"
--lang_params = "path/to/lang_params/sample.pickle"
# and so on...
Models Download link (weights) test accuracy
Default ProtoCNN link 87.46%
Default ProtoCNN + hyperparameter tuning link 90.08%
Custom Model (more details in Model Specification) link 92.31%

Run using Docker

Docker setup.

docker build . -t instadeep:latest

Open docker bash

# CPU only
docker run --rm -it --entrypoint bash instadeep:latest

# GPU
docker run --rm -it --entrypoint bash --gpus=all instadeep:latest

Visualize input data

python src/visualizations/visualize.py --data_dir data/random_split --save_path reports/data_visualizations --partition "train"

Many other options are available as well, pl see python src/visualizations/visualize.py --help

Train model

(Note: batch_size needs to be much smaller on CPU (bs=1). To use GPU use the --gpu flag.)

python src/train.py --batch_size=256

Many other options are available as well, pl see python src/train.py --help

Visualize training metrics like loss, accuracy, etc.

python src/visualizations/visualize_training_vals.py --metrics_file "path/to/file/sample.csv" --save_path "path/to/folder"

Many other options are available as well, pl see python src/visualizations/visualize.py --help

Get prediction for a single test sample

python src/predict.py --input_seq="Protein_seq" --model_checkpoint="lightning_logs/version_10/checkpoints/epoch=2-step=12738.ckpt"

Many other options are available as well, pl see python src/predict.py --help

Evaluate trained model of test set

python src/evaluate.py --gpu --model_checkpoint="lightning_logs/version_10/checkpoints/epoch=2-step=12738.ckpt" --test_set_dir="data/random_split/test"

Many other options are available as well, pl see python src/evaluate.py --help

Run without docker

(Tested on Python version 3.10.13)

# Install requirements (python 3.10)
pip install -r requirements.txt

# Export python path
export PYTHONPATH="${PYTHONPATH}:full/path/to/the/folder/Instadeep_takehome/"

"""Run any of the above commands now."""

Testing

We will be using pytest for this.

# Run tests
coverage run -m pytest src/tests/

# Generate coverage report
coverage report -m

Generated Coverage report:
drawing

Visualize results in tensorboard

# Run tensorboard by locating the tf_events file.
tensorboard --logdir=path/to/tensorboard/folder/sample_folder

Custom Model Specification

  1. I modified the architecture of the model by increasing the residual blocks, adding convolutional layers, increasing layer sizes and changing the input and output channels and made other small changes.
  2. The model architecture is as follows: default proto_cNN (left), modified bigger model (right)
  3. Note: To run the bigger model we'll have to do more changes like changes to the class ProtoCNN as well. It won't work out of box.

drawing drawing

protein-domain-protocnn's People

Contributors

pratt3000 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.