GithubHelp home page GithubHelp logo

lirongwu / mape-ppi Goto Github PK

View Code? Open in Web Editor NEW
206.0 2.0 36.0 6.98 MB

Code for ICLR 2024 (Spotlight) paper "MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding"

License: MIT License

Python 100.00%
codebook masked-modeling microenvironment ppi-networks protein-embedding protein-protein-interaction protein-representation-learning vocabulary

mape-ppi's Introduction

MAPE-PPI

MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding (Spotlight)

Lirong Wu, Yijun Tian, Yufei Huang, Siyuan Li, Haitao Lin, Nitesh V Chawla, Stan Z. Li. In ICLR, 2024.

Dependencies

conda env create -f environment.yml
conda activate MAPE-PPI

The default PyTorch version is 2.0.0 and cudatoolkit version is 11.7. They can be changed in environment.yml.

Dataset

Raw data of the three datasets (SHS27k, SHS148k, and STRING) can be downloaded from the Google Drive:

  • protein.STRING.sequences.dictionary.tsv Protein sequences of STRING
  • protein.actions.STRING.txt PPI network of STRING
  • STRING_AF2DB PDB files of protein structures predicted by AlphaFold2

Pre-process raw data to generate feature and adjacency matrices (also applicable to any new dataset):

python ./raw_data/data_process.py --dataset data_name

where data_name is one of the three datasets (SHS27k, SHS148k, and STRING).

For ease of use, we have pre-processed these three datasets and placed the processed data in Google Drive.

To use the processed data, please put them in `./data/processed_data/.

Usage

Pre-training and Inference on SHS27k/SHS148k/STRING

python -B train.py --dataset STRING --split_mode bfs

The hyperparameters customized for each dataset and data partitions are available in ./configs/param_config.json.

Pre-training on additional data, Inference on SHS27k/SHS148k/STRING

To pre-train with customized data (e.g., CATH or AlphaFoldDB datasets), please refer to the following steps:

(1) Download additional pre-training data (including their PDF files) from the official website.

(2) Pre-process pre-training PDB files as done in ./raw_data/data_process.py and transform into three files:

  • protein.nodes.pretrain_data.pt
  • protein.rball.edges.pretrain_data.npy
  • protein.knn.edges.pretrain_data.npy

where pretrain_data is the name of the additional pre-training dataset.

(3) Load pre-processed data and perform pretraining on it, running

python -B train.py --dataset STRING --split_mode bfs --pre_train pretrain_data

Loading the pre-trained model and Inference on SHS27k/SHS148k/STRING

We provide a pre-trained model in ./trained_model/ for PPI prediction on STRING. To use it, please run

python -B train.py --dataset STRING --split_mode bfs --ckpt_path ../trained_model/vae_model.ckpt

Citation

If you are interested in our repository and our paper, please cite the following paper:

@article{wu2024mape,
  title={MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding},
  author={Wu, Lirong and Tian, Yijun and Huang, Yufei and Li, Siyuan and Lin, Haitao and Chawla, Nitesh V and Li, Stan Z},
  journal={arXiv preprint arXiv:2402.14391},
  year={2024}
}

Feedback

If you have any issue about this work, please feel free to contact me by email:

mape-ppi's People

Contributors

eltociear avatar lirongwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mape-ppi's Issues

Cannot Process the Raw Data

When I use the command: "python data_process.py --dataset SHS27k" to pre=process the raw data, there occurs and error,
image

I wonder if the processing file is the right version. Thanks for your notice and really appreciate your work.

How draw Fig 5?

Dear Lirong:
Thank your representative work, I appreciate the analysis of Figure 5 and have some questions regarding its details.

  1. For Figure 5(a), the center of the clustering is the codebook. What do the remaining 2D points represent? Are they node representations?

  2. How was the distribution of amino acids counted? Specifically, how were the bar graphs obtained?

  3. How is the distribution of amino acids calculated by the codebook in Figure 5(c)? As far as I understand, the codebook doesn't have a direct link to amino acids.

about test data split

Dear author, you mention in your paper dividing the test data into three subsets based on whether two proteins have been present in the training data, including: (1) BS: both have been present; (2) ES: one of the proteins is present; (3) NS: neither occurs. How is this implemented and where is the code for this part?

About CATH dataset pretraining

Dear author, you have a pre-trained model on github, on which dataset was this model pre-trained? In your paper, you mentioned using the CATH dataset for pre-training. I think it is an interesting dataset, but I am new to the bioinformatics field and am not familiar with the CATH dataset. How to download the CATH dataset and use it in your model? Please don't hesitate to give your advice

python ./raw_data/data_process.py

Dear Authors,

I sincerely thank you for sharing your great work! (Congrat Spotlight too!)

Could you please share the following code? as it is not included in the current repo:

python ./raw_data/data_process.py

Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.