GithubHelp home page GithubHelp logo

vsomnath / holoprot Goto Github PK

View Code? Open in Web Editor NEW
45.0 2.0 8.0 896 KB

Multi-Scale Representation Learning on Proteins (NeurIPS 2021)

Home Page: https://arxiv.org/abs/2204.02337

License: MIT License

Python 83.87% Shell 0.68% CMake 15.45%
neurips-2021 machine-learning proteins geometry representation-learning

holoprot's Introduction

Multi-Scale Representation Learning on Proteins

(Under Construction and Subject to Change)

This is the official PyTorch implementation for HoloProt (Somnath et al. 2021)

holoprot-cov

Changelog

[30.06.2023]: Added all the raw & processed datasets and binaries to Zenodo.
[28.06.2023]: Added the binaries configuration used with the paper (Refer to Enviroment Variables section)

Installation

Binaries

Our work utilizes several binaries for generating surfaces, compressing them and computing chemical features and secondary structures.

  • MSMS (2.6.1). To compute the surface of proteins.
  • DSSP. To compute the secondary structure of proteins.
  • BLENDER. To fix meshes and remove any redundancies, while reducing them to a desired number of faces.
  • PDB2PQR (2.1.1), multivalue, and APBS (1.5). These programs are necessary to compute electrostatics charges.
Environment Variables

After downloading the binaries, one needs to set environment variables to the corresponding paths.

echo 'export PROT=/path/to/dir/' >> ~/.bashrc
echo 'export DSSP_BIN=' >> ~/.bashrc
echo 'export MSMS_BIN=/path/to/msms/' >> ~/.bashrc
echo 'export APBS_BIN=/path/to/apbs/bin/apbs' >> ~/.bashrc
echo 'export BLENDER_BIN=/path/to/blender/blender' >> ~/.bashrc
echo 'export PDB2PQR_BIN=/path/to/pdb2pqr/pdb2pqr' >> ~/.bashrc
echo 'export MULTIVALUE_BIN=/path/to/apbs/share/apbs/tools/bin/multivalue' >> ~/.bashrc
source ~/.bashrc

As a sanity check for correct installation, try entering $BINARY_NAME in the command line, and check if it produces a meaningful result. If it throws a lib.xx.xx.so not found, please try setting your LD_LIBRARY_PATH to the appropriate directories.

The binaries configuration used in this work can be found here. After untaring the file in an appropriate directory, please add the following commands to your ~/.bashrc file:

export LD_LIBRARY_PATH=$PATH_TO_DIR/binaries/boost/lib:${PATH_TO_MINICONDA}/lib:${PATH_TO_DIR}/binaries/apbs/lib:$HOME/lib:$LD_LIBRARY_PATH
Final installation

To install all dependencies, run

./install_dependencies.sh

If you want jupyter notebook support (may have errors), run the following commands (inside prot):

conda install -c anaconda ipykernel
python -m ipykernel install --user --name=prot

Change the kernel name to prot or create a new ipython notebook using prot as the kernel.

Datasets

Datasets are organized in the $PROT/datasets directory. The raw datasets are placed in $PROT/datasets/raw while the processed datasets are placed in $PROT/datasets/processed

Dataset Download

All datasets used in this work can be found on zenodo.

  1. Download files DATASET_NAME_raw.tar.gz to $PROT/datasets/raw and extract.
  2. Download files DATASET_NAME_s2b.tar.gz, DATASET_NAME_p2b_20.tar.gz to $PROT/datasets/processed/DATASET_NAME and extract.

where DATASET_NAME is one of pdbbind, enzyme.

Dataset Cleanup and Running binaries

Before preparing the graph objects, we need to clean up the pdb files and run the binaries. Possible set of tasks include:

  • pdbfixer: Clean up PDB files and add any missing residues.
  • dssp: Secondary structure computation using the DSSP binary
  • surface: Constructs the triangular surface mesh using MSMS and compresses it to a desired size using BLENDER
  • charges: Computes electrostatics on the given surface using PDB2PQR, APBS and MULTIVALUE binaries
  • all: Runs all the tasks listed above
python -W ignore scripts/preprocess/run_binaries.py --dataset DATASET_NAME --tasks TASK_NAME

where DATASET_NAME can be one of pdbbind, enzyme, and TASK_NAME is one of pdbfixer, dssp, surface, charges, all

Superpixel Preparation

Molecular superpixels are constructed using a modified version of ERS. Follow the steps below to first prepare the surface graphs, and then generate the molecular superpixel assignments,

python -W ignore scripts/preprocess/prepare_graphs.py --dataset DATASET_NAME --prot_mode surface
python -W ignore scripts/preprocess/generate_patches.py --dataset DATASET_NAME --seg_mode ers --n_segments N_SEGMENTS

HoloProt Graph Construction

EXP_NAME="ERS_balance=0.5_n_segments=20"
python -W ignore scripts/preprocess/prepare_graphs.py --dataset DATASET_NAME --prot_mode surface2backbone
python -W ignore scripts/preprocess/prepare_graphs.py --dataset DATASET_NAME --prot_mode patch2backbone --exp_name EXP_NAME --n_segments 20

After preprocessing, check if the following directories exist: $PROT/datasets/processed/DATASET_NAME/surface2backbone and $PROT/datasets/processed/DATASET_NAME/patch2backbone_n_segments=20

Running Experiments

We use wandb to track out experiments. Please make sure to have the setup complete before doing that.

Default configurations for running experiments can be found in config/train/DATASET_NAME/

For PDBBind, the files are organized as config/train/pdbbind/SPLIT.yaml where SPLIT is one of {identity30, identity60, scaffold}.

For Enzyme dataset, the file is config/train/enzyme/default_config.yaml.

To run the experiments for PDBBind,

python scripts/train/run_model.py --config_file config/train/pdbbind/SPLIT.yaml

To run experiments for Enzyme,

python scripts/train/run_model.py --config_file config/train/enzyme/default_config.yaml

Please raise an issue if the commands don't work as expected, or you need help interpreting an error message.

License

This project is licensed under the MIT-License. Please see LICENSE.md for more details.

Reference

If you find our code useful for your work, please cite our paper:

@inproceedings{
somnath2021multiscale,
title={Multi-Scale Representation Learning on Proteins},
author={Vignesh Ram Somnath and Charlotte Bunne and Andreas Krause},
booktitle={Advances in Neural Information Processing Systems},
editor={A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan},
year={2021},
url={https://openreview.net/forum?id=-xEk43f_EO6}
}

Please also consider citing the MaSIF work, whose code we use for preparing and computing features on surfaces:

@article{gainza2020deciphering,
  title={Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning},
  author={Gainza, P and Sverrisson, F and Monti, F and Rodol{\`a}, E and Boscaini, D and Bronstein, MM and Correia, BE},
  journal={Nature Methods},
  volume={17},
  number={2},
  pages={184--192},
  year={2020},
  publisher={Nature Publishing Group}
}

Contact

If you have any questions about the code, or want to report a bug, or need help interpreting an error message, please raise a GitHub issue.

holoprot's People

Contributors

vsomnath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

holoprot's Issues

A question about the surface global graph embedding code

)3EF93G3ZOO6Z1@LXK8JUVN

Hello! I've learned a lot from your model and it has inspired some ideas for me. I'm trying to replicate your code to obtain global embeddings for surface graph summarization and then conduct testing. In the code snippet I've added (highlighted in red), the second line for global pooling seems to be causing an error. It appears that there might be a mismatch in dimensions between 'x' and 'batch'. I'm unsure whether I've correctly defined 'batch' in that line. Could you please take a look and provide some guidance? Thank you! My English skills might not be perfect, so I apologize if my explanation isn't very clear.

Integrating dMaSIF

Hello authors,

First off, really well written paper & great code base. Enjoyed reading it and managed to understand the key concepts on just the first pass.

As you have rightly pointed out in your paper, dMaSIF is a concurrent work that greatly optimises MaSIF, especially by removing the need for pre-computed features, which is not trivial to set up + slows down inference.

Would you happen to have any intention to integrate dMaSIF into your work?

Or phrased another way, what are the steps needed to accomplish it? I am willing to contribute.

Keep up the great research! (I also really enjoyed GraphRetro, having worked under Connor myself!)

Best,
Min Htoo

Missing BackboneData ?

Hello,

Thanks for the work and clean repo. I wanted to use your processed data for the EC task, with my own models. I untared raw and processed surface2backbone. However, there seems to be a missing class in the data/base.py. Indeed, when I try to load the data using the following code :

if __name__ == '__main__':
    raw_dir = os.path.abspath(f"../../data/enzyme/")
    processed_dir = os.path.abspath(f"../../data/")
    train_dataset = EnzymeDataset(mode='train', raw_dir=raw_dir,
                                  processed_dir=processed_dir,
                                  add_targets=True,
                                  prot_mode='surface2backbone')
    item = train_dataset[0]

I get the following error :

Traceback (most recent call last):
  File "/home/vmallet/anaconda3/envs/atom2d/lib/python3.8/site-packages/torch/serialization.py", line 789, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/vmallet/anaconda3/envs/atom2d/lib/python3.8/site-packages/torch/serialization.py", line 1131, in _load
    result = unpickler.load()
  File "/home/vmallet/anaconda3/envs/atom2d/lib/python3.8/site-packages/torch/serialization.py", line 1124, in find_class
    return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'BackboneData' on <module 'holoprot.data.base' from '/home/vmallet/projects/atom2d/atom2d/holoprot/data/base.py'>

Do you think this class definition could be missing ? Also, am I using the right loading ? I am trying to retrieve the meshes, residue graphs and node/vertices features that you used for training.

Thanks a ton in advance for your help !

Best,
Vincent

Questions on the multi-level approach

Hello guys, I really enjoyed the paper and the approach of combining different levels of protein representation. I was wondering if I could try a similar approach with my use case. I do not work on the surface level, but I more interested in combining the residue level of my graph (let's call this residue graph Gr) and the more fine-grained atom level (Ga). I can relate each residue node of Gr to many atom nodes from Ga. My questions to you would be:

  1. Would that approach be doable in my case (from your intuition)?
  2. If yes, I am mostly interested in the part of your code that takes care of the update and messaging between the 2 levels (i.e. the computation of hidden features hr for graph Gr, the creation of the new mapping x for graph Ga, then the computation of embeddings ha, and finally the aggregation of original graph embeddings c). Can you guys point me to the methods/code of interest in the repository?

Thanks a lot! Great work!

Question on generation of patches

I tested pulling out one of the patches after running ers.computeSegmentation and then pulled out a submesh from for instance:

submesh = pymesh.submesh(mesh, np.where(patch_labels==2)[0], 0)

And I expected the the triangles to all be collected in a patch, however it looks like they are scattered all about from the pymol ply plugin from MaSIF. Is this intended? I cannot get much from the paper on what the indended behavior is.

image

APBS and compute charge

4d7b: Could not compute charges because of Charges cannot be computed. Missing file. 4d7b_out.csv. Trying with fixed file.
4d7b: Could not compute charges for fixed file because of Charges cannot be computed. Missing file. 4d7b_out.csv. Returning None

APBS 1.5 cant work. I have APBS 3.4.1 but some file cant compute charge.(apbs cant get xxx.out.csv)

blender path

I install Blender with "pip install bpy==2.91a0 && bpy_post_install",
but i cant find "BLENDER_BIN=/path/to/blender/blender",
I want to konw how to install Blender correctly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.