This repository contains scripts and data to repeat the analyses in Blaabjerg et al.:
"A joint embedding of protein sequence and structure enables robust variant effect predictions".
Execute the pipeline using src/run_pipeline.py
.
This main script will call other scripts in the src
directory to train, validate and test the SSEmb model as described in the paper.
The code has been developed and tested in a Unix environment using the following packages:
python==3.7.16
pytorch==1.13.1
pyg==2.2.0
pytorch-scatter==2.1.0
pytorch-cluster==1.6.0
fair-esm==2.0.0
numpy==1.21.6
pandas==1.3.5
biopython==1.79
openmm==7.6.0
pdbfixer==1.8.1
scipy==1.7.3
scikit-learn==1.0.2
tqdm==4.64.1
pytz==2022.7
matplotlib==3.2.2
mpl-scatter-density==0.7
Data related to the paper can be download here: https://zenodo.org/records/10362251.
The directory contains the folding subdirectories:
train
model_weights
: Final weights for the SSEmb-MSATransformer and SSEmb-GVPGNN modules.optimizer_weights
: Parameters for the optimizer at time of early-stopping.msa
: MSAs for the proteins in the training set.
mave_val
:msa
: MSAs for the proteins in the MAVE validation set.
rocklin
:msa
: MSAs for the proteins in the mega-scale stability change test set.
proteingym
:structure
: AlphaFold-2 generated structures used for the ProteinGym test.msa
: MSAs for the proteins in the ProteinGym test set.
scannet
:model_weights
: Final weights for the SSEmb downstream model trained on the ScanNet data set.optimizer_weights
: Parameters for the optimizer at time of early-stopping.msa
: MSAs for the proteins in the ScanNet data set.
Source code and model weights are licensed under the MIT License.
Code for the original MSA Transformer was developed by the ESM team at Meta Research:
https://github.com/facebookresearch/esm.
Code for the original GVP-GNN was developed by Jing et al:
https://github.com/drorlab/gvp-pytorch.
Please cite:
Lasse M. Blaabjerg, Nicolas Jonsson, Wouter Boomsma, Amelie Stein, Kresten Lindorff-Larsen (2023). A joint embedding of protein sequence and structure enables robust variant effect predictions. bioRxiv, 2023.12.
@article {Blaabjerg2023.12.14.571755,
author = {Lasse M. Blaabjerg and Nicolas Jonsson and Wouter Boomsma and Amelie Stein and Kresten Lindorff-Larsen},
title = {A joint embedding of protein sequence and structure enables robust variant effect predictions},
elocation-id = {2023.12.14.571755},
year = {2023},
doi = {10.1101/2023.12.14.571755},
URL = {https://www.biorxiv.org/content/early/2023/12/16/2023.12.14.571755},
eprint = {https://www.biorxiv.org/content/early/2023/12/16/2023.12.14.571755.full.pdf},
journal = {bioRxiv}
}