rostlab / bindpredict Goto Github PK

Prediction of binding residues for metal ions, nucleic acids, and small molecules.

Python 100.00%

binding-sites prediction proteins bioinformatics embeddings

bindpredict's Introduction

bindEmbed21

bindEmbed21 is a method to predict whether a residue in a protein is binding to metal ions, nucleic acids (DNA or RNA), or small molecules. Towards this end, bindEmbed21 combines homology-based inference and Machine Learning. Homology-based inference is executed using MMseqs2 [1]. For the Machine Learning method, bindEmbed21DL uses ProtT5 embeddings [2] as input to a 2-layer CNN. Since bindEmbed21 is based on single sequences, it can easily be applied to any protein sequence.

Usage

run_bindEmbed21DL.py shows an example how to generate binding residue predictions using the Machine Learning part of bindEmbed21 (bindEmbed21DL)

run_bindEmbed21HBI.py shows an example how to generate bidning residue predictions using the homology-inference part of bindEmbed21 (bindEmbed21HBI)

run_bindEmbed21.py combines ML and HBI into the final method bindEmbed21

develop_bindEmbed21DL.py provides the code to reproduce the bindEmbed21DL development (hyperparameter optimization, training, performance assessment on the test set).

All needed files and paths can be set in config.py (marked as TODOs).

Data

Development Set

The data set used for training and testing was extracted from BioLip [3]. The UniProt identifiers for the 5 splits used during cross-validation (DevSet1014), the test set (TestSet300), and the independent set of proteins added to BioLip after November 2019 (TestSetNew46) as well as the corresponding FASTA sequences and used binding annotations are made available in the data folder.

The trained models are available in the trained_models folder.

ProtT5 embeddings can be generated using the bio_embeddings pipeline [4]. To use them with bindEmbed21, they need to be converted to use the correct keys. A script for the conversion can be found in the folder utils.

Sets for homology-based inference

For the homology-based inference (bindEmbed21HBI), query proteins will be aligned against big80 to generate profiles. Those profiles are then searched against a lookup set of proteins with known binding residues. The pre-computed MMseqs2 database files and the FASTA file for the lookup database can be downloaded here:

Pre-computed big80 DB: ftp://rostlab.org/bindEmbed21/profile_db.tar.gz
Pre-computed lookup DB: ftp://rostlab.org/bindEmbed21/lookup_db.tar.gz
FASTA for lookup DB: ftp://rostlab.org/bindEmbed21/lookup.fasta

Human proteome predictions

We applied bindEmbed21DL as well as homology-based inference to the entire human proteome. While annotations were only available for 15% of the human proteins, homology-based inference allowed transferring annotations for 48% (9,694) and bindEmbed21DL provided binding predictions for 92% (18,663) of the human proteome. Both predictions are available in the folder human_proteome. For predictions made using homology-based inference, values of -1.0 refer to position which were not inferred, and therefore, were considered non-binding.

Availability

bindEmbed21 is also part of the bio_embeddings pipeline [4]. Also, predictions of bindEmbed21DL can also be run and visualized on a predicted 3D structure using LambdaPP [5].

Requirements

bindEmbed21 is written in Python3. In order to execute bindEmbed21, Python3 has to be installed locally. Additionally, the following Python packages have to be installed:

numpy
scikit-learn
torch
pandas
h5py

To be able to run homology-based inference, MMseqs2 has to be locally installed. Otherwise, it is also possible to only run the Machine Learning part of bindEmbed21 (bindEmbed21DL).

Cite

In case, you are using this method and find it helpful, we would appreciate if you could cite the following publication:

Littmann M, Heinzinger M, Dallago C, Weissenow K, Rost B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci Rep 11, 23916 (2021). https://doi.org/10.1038/s41598-021-03431-4

References

[1] Steinegger M, Söding J (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35.

[2] Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Bhowmik D, Rost B (2021). ProtTrans: towards cracking the language of life's code through self-supervised deep learning and high performance computing. bioRxiv.

[3] Yang J, Roy A, Zhang Y (2013). BioLip: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Research, 41.

[4] Dallago C, Schütze K, Heinzinger M, Olenyi T, Littmann M, Lu AX, Yang KK, Min S, Yoon S, Morton JT, & Rost B (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113

[5] Olenyi T, Marquet C, Heinzinger M, Kröger B, Nikolova T, Bernhofer M, Sändig P, Schütze K, Littmann M, Mirdita M, Steinegger M, Dallago C, & Rost B (2022). LambdaPP: Fast and accessible protein-specific phenotype predictions. bioRxiv

bindPredictML17

If you are interested in the predecessor of bindEmbed21, bindPredictML17, you can find all relevant information in the subfolder bindPredictML17.

bindpredict's People

Contributors

Stargazers

Watchers

Forkers

ndnng bethsauer ruanzy13 xianeggs clvnmng yang-wang-2020 adelsamir01 riavinod bishnukuet

bindpredict's Issues

Automatic guessing

Hi @mariasche , this looks great. Is there a way of automating the method selection? E.g.: using the file extension or due to some formatting of the text file (e.g.: on line XY it says reprof or profphd).

This would be awesome :)

Empty predictions folder

Hi,

After i followed every steps my folder output is empty and i have hard times trying figure it out why ? Maybe you already saw this once and have and idea of how could i fix this ?

Bests,

Reproducing independent test set results

Hi,

I've been trying to reproduce the results of bindpredict (DL only) on the independent test set.

I'm using the provided checkpoints with the following changes (to configure paths):

Diff from master

diff --git a/config.py b/config.py
index 877f745..e39f969 100644
--- a/config.py
+++ b/config.py
@@ -11,13 +11,13 @@ class FileSetter(object):
 
     @staticmethod
     def embeddings_input():
-        return ''
+        return 'embeddings_independent/t5_embeddings/embeddings_converted.h5'
         # TODO set path to embeddings, this should be a .h5-file generated containing per-residue embeddings for all
         #  proteins with key: UniProt-ID, value: embeddings
 
     @staticmethod
     def predictions_folder():
-        return ''  # TODO set path to where predictions should be written
+        return 'predictions'  # TODO set path to where predictions should be written
 
     @staticmethod
     def profile_db():
@@ -58,11 +58,12 @@ class FileSetter(object):
 
     @staticmethod
     def test_ids_in():
-        return 'data/development_set/uniprot_test.txt'  # test ids used during development; available on GitHub
+        with open('data/independent_set/indep_set.txt') as f:
+            return [id_.strip() for id_ in f.readlines()]
 
     @staticmethod
     def fasta_file():
-        return 'data/development_set/all.fasta'  # path to development set; available on GitHub
+        return 'data/independent_set/indep_set.fasta'  # path to development set; available on GitHub
 
     @staticmethod
     def binding_residues_by_ligand(ligand):
diff --git a/data/development_set/uniprot_test.txt b/data/development_set/uniprot_test.txt
index 31dfd9d..3660c04 100644
--- a/data/development_set/uniprot_test.txt
+++ b/data/development_set/uniprot_test.txt
@@ -124,8 +124,6 @@ P50465
 P25524
 P46859
 Q9HAN9
-P84233
-P62799
 P06897
 P02281
 Q9HLQ2
diff --git a/develop_bindEmbed21DL.py b/develop_bindEmbed21DL.py
index c79cca9..a31790f 100644
--- a/develop_bindEmbed21DL.py
+++ b/develop_bindEmbed21DL.py
@@ -12,7 +12,7 @@ def main():
 
     keyword = sys.argv[1]
 
-    path = ''  # TODO set path to working directory
+    path = './trained_models'  # TODO set path to working directory
     Path(path).mkdir(parents=True, exist_ok=True)
 
     cutoff = 0.5

The embeddings were generated with bio_embeddings, though I suspect a newer version as I had to remove a couple of duplicated sequences for it to work.

These are the results from develop_bindEmbed21DL.py testing

Prepare data
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
Load model
Calculate predictions
Number of input features: 1024
No residues annotated as binding for Q9LFM3
No residues annotated as binding for Q8A5J2
No residues annotated as binding for P43215
No residues annotated as binding for A0A0B0QJR1
No residues annotated as binding for Q9K943
No residues annotated as binding for P9WPA4
No residues annotated as binding for P0AG51
No residues annotated as binding for A0A0C2W6A5
No residues annotated as binding for Q9SJ89
No residues annotated as binding for Q9BTM1
No residues annotated as binding for Q8LGJ5
No residues annotated as binding for B8H4R9
No residues annotated as binding for P39230
No residues annotated as binding for Q03503
No residues annotated as binding for D0CAL0
No residues annotated as binding for E1C9L3
No residues annotated as binding for K5BJ73
No residues annotated as binding for F0Q4R9
No residues annotated as binding for O94408
No residues annotated as binding for Q58380
No residues annotated as binding for Q9H492
No residues annotated as binding for Q4KCZ1
No residues annotated as binding for Q9I0B9
No residues annotated as binding for Q9KDJ7
No residues annotated as binding for U2EQ00
No residues annotated as binding for P60624
No residues annotated as binding for F0NDX5
No residues annotated as binding for D0CBZ8
No residues annotated as binding for D0CD18
No residues annotated as binding for Q32904
No residues annotated as binding for A0A1Y1BWQ0
No residues annotated as binding for L7T0L4
No residues annotated as binding for Q2G285
No residues annotated as binding for Q709H6
No residues annotated as binding for X2D812
No residues annotated as binding for P17227
No residues annotated as binding for Q07654
No residues annotated as binding for D0CDQ7
overall
CovOneBind: 45 (1.000)
Bound: With predictions: 8, Without predictions: 0
Not Bound: With predictions: 37, Without predictions: 1
TP: 93, FP: 722, TN: 5104, FN: 187
Prec: 0.122 +/- 0.081, Recall: 0.066 +/- 0.051, F1: 0.072 +/- 0.051, MCC: 0.068 +/- 0.046, Acc: 0.817 +/- 0.055
metal
CovOneBind: 32 (1.000)
Bound: With predictions: 3, Without predictions: 0
Not Bound: With predictions: 29, Without predictions: 14
TP: 21, FP: 97, TN: 4425, FN: 23
Prec: 0.050 +/- 0.071, Recall: 0.042 +/- 0.058, F1: 0.045 +/- 0.063, MCC: 0.044 +/- 0.062, Acc: 0.971 +/- 0.009
nucleic
CovOneBind: 17 (1.000)
Bound: With predictions: 2, Without predictions: 0
Not Bound: With predictions: 15, Without predictions: 29
TP: 15, FP: 296, TN: 1542, FN: 16
Prec: 0.032 +/- 0.054, Recall: 0.050 +/- 0.094, F1: 0.038 +/- 0.068, MCC: 0.030 +/- 0.056, Acc: 0.768 +/- 0.143
small
CovOneBind: 34 (0.971)
Bound: With predictions: 6, Without predictions: 1
Not Bound: With predictions: 28, Without predictions: 11
TP: 54, FP: 393, TN: 4438, FN: 172
Prec: 0.112 +/- 0.089, Recall: 0.045 +/- 0.043, F1: 0.057 +/- 0.050, MCC: 0.053 +/- 0.046, Acc: 0.884 +/- 0.028

Obviously these are much worse than the results reported in the paper, what have I done wrong?