GithubHelp home page GithubHelp logo

psyche11 / cpi_prediction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from masashitsubaki/cpi_prediction

0.0 0.0 0.0 9.65 MB

This is a code for compound-protein interaction (CPI) prediction based on a graph neural network (GNN) for compounds and a convolutional neural network (CNN) for proteins.

License: Apache License 2.0

Python 93.58% Shell 6.42%

cpi_prediction's Introduction

Compound-protein interaction (CPI) prediction using a GNN for compounds and a CNN for proteins

This code is an implementation of our paper "Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences (Bioinformatics, 2018)" in PyTorch. In this repository, we provide two CPI datasets: human and C. elegans created by "Improving compound–protein interaction prediction by building up highly credible negative samples (Bioinformatics, 2015)." Note that the ratio of positive and negative samples is 1:1.

In our problem setting of CPI prediction, an input is the pair of a SMILES format of compound and an amino acid sequence of protein; an output is a binary label (interact or not). The SMILES is converted with RDKit and we obtain a 2D graph-structured data of the compound (i.e., atom types and their adjacency matrix). The overview of our CPI prediction by GNN-CNN is as follows:

The details of the GNN and CNN are described in our paper. Note that this implementation is a simpler than the model proposed in our original paper (e.g., without edge vectors and their updates described in Eqs (5) and (6)).

In addition, the above CPI prediction uses our proposed GNN, which is based on learning representations of r-radius subgraphs (i.e., fingerprints) in molecules. We also provide an implementation of the GNN for predicting various molecular properties such as drug efficacy and photovoltaic efficiency in https://github.com/masashitsubaki/GNN_molecules.

Characteristics

  • This code is easy to use. After setting the environment (e.g., PyTorch), preprocessing data and learning a model can be done by only two commands (see "Usage").
  • If you prepare a CPI dataset with the same format as provided in the dataset directory, you can learn our GNN-CNN with your dataset by the two commands (see "Training of our GNN-CNN using your CPI dataset").

Requirements

  • PyTorch
  • scikit-learn
  • RDKit

Usage

We provide two major scripts:

  • code/preprocess_data.py creates the input tensor data of CPIs for processing with PyTorch from the original data (see dataset/human or celegans/original/data.txt).
  • code/run_training.py trains the model using the above preprocessed data (see dataset/human or celegans/input).

(i) Create the tensor data of CPIs with the following command:

cd code
bash preprocess_data.sh

The preprocessed data are saved in the dataset/input directory.

(ii) Using the preprocessed data, train the model with the following command:

bash run_training.sh

The training and test results and the model are saved in the output directory (after training, see output/result and output/model).

(iii) You can change the hyperparameters in preprocess_data.sh and run_training.sh. Try to learn various models.

Result

Learning curves (x-axis is epoch and y-axis is AUC) on the test datasets of human and C. elegans are as follows:

These results can be reproduce by the above two commands (i) and (ii).

Training of our GNN-CNN using your CPI dataset

In the directory of dataset/human or celegans/original, we now have the original data "data.txt" as follows:

CC[C@@]...OC)O MSPLNQ...KAS 0
C1C...O1 MSTSSL...FLL 1
CCCC(=O)...CC=C1 MAGAGP...QET 0
...
...
...
CC...C MKGNST...FVS 0
C(C...O)N MSPSPT...LCS 1

Each line has "SMILES sequence interaction." Note that, the interaction 1 means that "the pair of SMILES and sequence has interaction" and 0 means that "the pair does not have interaction." If you prepare a dataset with the same format as "data.txt" in a new directory (e.g., dataset/yourdata/original), you can train our GNN-CNN using your dataset by the above two commands (i) and (ii).

TODO

  • Preprocess data contains "." in the SMILES format (i.e., a molecule contains multi-graphs).
  • Provide some pre-trained model and the demo scripts.
  • Implement an efficient batch processing of the attention mechanism bridging two different architectures (GNN and CNN).

How to cite

@article{tsubaki2018compound,
  title={Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences},
  author={Tsubaki, Masashi and Tomii, Kentaro and Sese, Jun},
  journal={Bioinformatics},
  year={2018}
}

cpi_prediction's People

Contributors

masashitsubaki avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.