boun-tabi / biochemical-lms-for-drug-design Goto Github PK

Code for the paper "Exploiting Pretrained Biochemical Language Models for Targeted Drug Design", to appear in Bioinformatics, Proceedings of ECCB2022.

License: MIT License

Jupyter Notebook 97.28% Python 2.72%

cheminformatics denovo-design molecule-generation pretrained-language-models

biochemical-lms-for-drug-design's Introduction

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

About

This repository contains code for "Exploiting Pretrained Biochemical Language Models for Targeted Drug Design", to appear in Bioinformatics, Proceedings of ECCB2022. The accompanying materials are available in Zenodo

Installation

Install the required packages with pip:

pip install -r requirements.txt

Training

ChemBERTaLM

To train ChemBERTaLM model, first save MOSES data splits on disk with the following command:

python src/data/download_moses_data.py

Then, train the model with the script below:

python src/molecular_training.py --model_name_or_path seyonec/PubChem10M_SMILES_BPE_450k \
--evaluation_strategy epoch --train_file data/moses/train.txt --validation_file data/moses/test.txt \
--do_train --do_eval --output_dir models/ChemBERTaLM --block_size 256 --save_strategy epoch --num_train_epochs 10

Once training is completed, sample 30K molecules from ChemBERTaLM and benchmark it against MOSES

python src/molecular_generation.py --model models/ChemBERTaLM --num_return_sequences 30000 --do_sample
python src/molecular_evaluation.py --model models/ChemBERTaLM

Target Specific Models

Training target specific models require protein-ligand interactions from BindingDB. Data must be first prepared by following instructions described in notebooks/ or downloaded from corresponding Zenodo repository

Targeted generative models can be then finetuned with initializing from ProteinRoBERTa and ChemBERTa variants (ChemBERTaLM or the original ChemBERTa)

python src/targeted_training.py --output_dir models/EncDecLM \
--protein_model ProteinRoBERTa --ligand_model ChemBERTaLM \
--num_train_epochs 10

The baseline model T5 can be trained from scratch with the following snippet:

python src/targeted_training.py  --output_dir models/T5 --num_train_epochs 20

After training, molecules targeting validation and test proteins can be generated by specifying the trained model path:

python src/targeted_generation.py --model models/EncDecLM

Next, benchmarking metrics can be computed as follows:

python src/targeted_evaluation.py --model models/EncDecLM

Docking

Generated molecules can be further evaluated with performing docking. This requires installation of additional packages: OpenBabel, PyMol and gnina. OpenBabel and PyMol can be downloaded by conda & pip package managers as follows:

conda install -c openbabel openbabel
conda install -c schrodinger pymol-bundle
pip install py3Dmol

To install gnina, please refer to the official documentation.

To perform docking for a protein of interest, its 3D structure in complex with a ligand is needed. Such structures can be chosen from PDB.

To dock a protein whose structure is already defined against a set of molecules, you can execute the command below by giving UniProt ID of the protein and the path of the sdf file storing molecules:

python src/run_docking.py --uniprot_id {protein_uniprot_id} --mol_file {path_to_sdf}

One can also provide PDB ID of a protein structure and ID of the ligand in complex with that protein instead of UniProt ID:

python src/run_docking.py --target_pdb_id {PDB_ID} --ligand_id {ligand_ID} --mol_file {path_to_sdf}

Citation

@article{10.1093/bioinformatics/btac482,
    author = {Uludoğan, Gökçe and Ozkirimli, Elif and Ulgen, Kutlu O and Karalı, Nilgün and Özgür, Arzucan},
    title = "{Exploiting pretrained biochemical language models for targeted drug design}",
    journal = {Bioinformatics},
    volume = {38},
    number = {Supplement_2},
    pages = {ii155-ii161},
    year = {2022},
    doi = {10.1093/bioinformatics/btac482},
    url = {https://doi.org/10.1093/bioinformatics/btac482},
}

biochemical-lms-for-drug-design's People

Contributors

Stargazers

Watchers

Forkers

dongcf nasa03

biochemical-lms-for-drug-design's Issues

unconditonal generation with pretrained weights yields molecules with poor validity

Hi, I really like your work and I am facing a little problem trying to reproduce the results in your paper.

I downloaded the pretrianed weights from https://zenodo.org/record/6832146 as the paper suggested and followed the instruction in readme to generate molecules by running python src/molecular_generation.py --model models/ChemBERTaLM --num_return_sequences 30000 --do_sample. But the validity of generated molecules is about 0.13. I guess maybe your uploaded weights are not fully trained.

Below are some of the generated molecules:

CC(CNC(=O)C1CCC(=O)N1)OC1
CC1SC(=O)N(CC(=O)NC(C)C
CC1CC(C)C(S(=O)(=O)NCCCOC(C
CN1NCNC1-C1CCCC(NC(=O)C2CCC3
CCOC1NCCNC1N1CCN(CC(=O)NCC(F)(F
O=C(NCC12CC3CC(CC(C3)C1
COCC(C)NC(=O)NC(C)C1CCC(BR
COCCNC1OC(-C2CCCO2)NC1C#N
O=C(NC1CCC(N2CNNC2)CC1)C
C=C(C)CN(CC)C(=O)CN(C
COC1NCCCC1CNC(=O)CN1CC(C)OCC1C
CN1CNNC1SCC(=O)C1CCC2C(C1)
CC(C)C1CCSC1C(=O)NCC1C[NH
O=C(NCC1CCCCC1)N1CCC(OCCCO)CC1
CCCCNC(=O)N1CC2NN(C(C)C)C
CC1CCC(OCC(=O)N2CCCC2C2CCC[NH

Docking

In regards to the src/run_docking code responsible for running molecular docking, I have come across a concept that I find a bit perplexing. It appears that a Uniprot_ID is associated with both a ligand and a receptor, which has raised some questions for me.

From what I understand, a Uniprot_ID typically corresponds to a protein, which then serves as a receptor in molecular docking. However, I'm curious about the inclusion of the term "ligand." Could you please clarify the relationship between a Uniprot_ID, a ligand, and a receptor in this context? Additionally, could you point me to the source where I can find these correspondences, as mentioned in the link?

Furthermore, I noticed in the code here that PyMOL is used to remove the ligand from the PDB file. This processed file, named {target}-receptor.pdb, is then used as a receptor for scoring against a molecule's SDF file. Could you elaborate on the necessity of removing the ligand from the obtained PDB file before conducting docking with the small molecule for scoring?

I appreciate your assistance in clarifying these points. Thank you for your time and expertise.

Dataset

I apologize for bothering you again.

I am feeling a bit puzzled regarding the dataset. I have noticed that there are two versions available for the validation and test dataset: one with "uniq" and one without. I am uncertain about the reason behind having two versions of the dataset and their respective purposes. I would greatly appreciate it if you could kindly provide some clarification on this matter. Thank you for your patience and assistance.

The ProteinRoBERTa model

Hi, thank you very much for sharing this great work.

However, I have encountered an issue. The source code relies on the pretrained ProteinRoBERTa model, but it appears that the ProteinRoBERTa model is not available in HuggingFace's official model repository. I would like to know if there is an alternative source or repository from which the model can be obtained.

Once again, I want to express my admiration for the authors' work and thank you in advance for any assistance provided in addressing this issue.

boun-tabi / biochemical-lms-for-drug-design Goto Github PK

biochemical-lms-for-drug-design's Introduction

Exploiting Pretrained Biochemical Language Models for Targeted Drug Design

About

Installation

Training

ChemBERTaLM

Target Specific Models

Docking

Citation

biochemical-lms-for-drug-design's People

Contributors

Stargazers

Watchers

Forkers

biochemical-lms-for-drug-design's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs