jrwnter / cddd Goto Github PK

Implementation of the Paper "Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations" by Robin Winter, Floriane Montanari, Frank Noe and Djork-Arne Clevert.

License: MIT License

Python 99.42% Shell 0.30% Dockerfile 0.28%

cddd's Introduction

Continuous and Data-Driven Descriptors (CDDD)

Installing

Prerequisites

python 3
tensorflow 1.10
numpy
rdkit
scikit-learn

Conda

Create a new enviorment:

git clone https://github.com/jrwnter/cddd.git
cd cddd
conda env create -f environment.yml
source activate cddd

Install tensorflow without GPU support:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.10.0-cp36-cp36m-linux_x86_64.whl

Or with GPU support:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl

On Windows, with CPU only:

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow-1.10.0-cp36-cp36m-win_amd64.whl

And install the cddd package:

pip install .

Downloading Pretrained Model

A pretrained model as described in ref. 1 is available on Google Drive. Download and unzip by execuiting the bash script "download_default_model.sh":

./download_default_model.sh

The default_model.zip file can also be downloaded manualy under https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Testing

Extract molecular descriptors from two QSAR datasets (ref. 2,3) and evaluate the perfromance of a SVM trained on these descriptors.

cd example
python3 run_qsar_test.py --model_dir ../default_model

or with gpu support on e.g. device 0:

python3 run_qsar_test.py --model_dir ../default_model --use_gpu --device 0

The accuracy on the Ames dataset should be arround 0.814 +/- 0.006.

The r2 on the Lipophilicity dataset should be arround 0.731 +/- 0.029.

Getting Started

Extracting Molecular Descripotrs

Run the script run_cddd.py to extract molecular descripotrs of your provided SMILES:

cddd --input smiles.smi --output descriptors.csv  --smiles_header smiles

Supported input:

.csv-file with one SMILES per row
.smi-file with one SMILES per row

For .csv: Specify the header of the SMILES column with the flag --smiles_header (default: smiles)

Inference Module

The pretrained model can also be imported and used directly in python via the inference class:

import pandas as pd
from cddd.inference import InferenceModel
from cddd.preprocessing import preprocess_smiles

Load and preprocess data:

ames_df = pd.read_csv("example/ames.csv", index_col=0)
ames_df["smiles_preprocessed"] = ames_df.smiles.map(preprocess_smiles)
ames_df = ames_df.dropna()
smiles_list = ames_df["smiles_preprocessed"].tolist()

Create a instance of the inference class:

inference_model = InferenceModel()

Encode all SMILES into the continuous embedding (molecular descriptor):

smiles_embedding = inference_model.seq_to_emb(smiles_list)

The infernce model instance can also be used to decode a molecule embedding back to a interpretable SMILES string:

decoded_smiles_list = inference_model.emb_to_seq(smiles_embedding)

References

[1] R. Winter, F. Montanari, F. Noe and D. Clevert, Chem. Sci, 2019, https://pubs.rsc.org/en/content/articlelanding/2019/sc/c8sc04175j#!divAbstract

[2] K. Hansen, S. Mika, T. Schroeter, A. Sutter, A. Ter Laak, T. Steger-Hartmann, N. Heinrich and K.-R. MuÌ´Lller, J. Chem. Inf. Model., 2009, 49, 2077–2081.

[3] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing and V. Pande, Chemical Science, 2018, 9, 513–530.

cddd's People

Contributors

Stargazers

Watchers

cddd's Issues

Git Large File Storage

Hi, just wanted to share our use case about the CDDD lib.

I had to compile my own .whl package and include the default_model in it because Google Drive is blocked by firewalls here.

I am storing the 500MB .whl in github with Git LFS.

I recommend users to use my cddd.whl unofficial package for simplicity.
Now we can just use the default_model without duplicating cddd folders and data over projects.

Wondering if the google drive url can be replaced with git lfs url.

Thanks for the tool! All the best!

How do you tokenize the SMILES representation?

Extracting Molecular Descriptors

I was trying to extract molecular descriptors for the following smiles, however, the CDDD couldn't extract them. Do you know what is the reason? Thanks in advance.
smiles
COc1cc2c(cc1OC)CN(Cc1ccc(-c3ccc4c(c3)C(N3CCc5cc(OC)c(OC)cc5C3)C(=O)N4)cc1)CC2
CC(C)c1csc(C@HNC(=O)c2cnc(Oc3ccc4c(c3)CCC(c3ccccc3)O4)s2)n1
Cl.O=C(O)[C@@h]1CCCN(CCC=C(c2sccc2COc2cccc3ccccc23)c2sccc2COc2cccc3ccccc23)C1
COCCOc1cc2c(cc1OC)CCN(CCOc1ccc(/C=C/C(=O)c3ccc(C)c(NC(=O)c4ccc5ccccc5n4)c3)cc1)C2

Can't activate the cddd conda env

Following the instructions on setting up a CPU environment on OS X, I am unable to activate the cddd env that I created.

20:23 $ source activate cddd

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/anaconda3/lib/python3.6/site-packages/conda/cli/main.py", line 98, in main
        return activator_main()
      File "/anaconda3/lib/python3.6/site-packages/conda/activate.py", line 632, in main
        print(activator.execute(), end='')
    UnicodeEncodeError: 'ascii' codec can't encode character '\u2718' in position 26: ordinal not in range(128)

This is on OS X 10.12.6, Python 3.6.5 and conda 4.5.11

Normalizing embeddings before cross validation?

In the demo script, embeddings are normalized before feeding into svm models. By doing that, the CV performance might be an overestimate for generalizability, see here and here

Can't correctly pad a character in a smiles and regenerate padded smiles using cddd

I was trying to pad N6 in aniline

import os
import sys
from IPython.display import SVG
import torch
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D

sys.path.insert(0, './cddd')
from cddd.inference import InferenceModel 
inference_model = InferenceModel()

aniline_smi = "C1=CC=C(C=C1)N"
m = Chem.MolFromSmiles(aniline_smi)
edit_m = Chem.RWMol(m)
# Replace N6 by Yttrium and then replace Yttrium by padding char
edit_m.GetAtomWithIdx(6).SetAtomicNum(39)
masked_N6_aniline_smiles = Chem.MolToSmiles(edit_m).replace('[Y]','</s>')
print(masked_N6_aniline_smiles)
'</s>c1ccccc1'
# convert smiles to embedding
cddd_descr_masked_aniline = inference_model.seq_to_emb(masked_N6_aniline_smiles)
print(cddd_descr_masked_aniline.shape)
# regenerate  smiles from embedding
regenerate_masked_aniline = inference_model.emb_to_seq(torch.tensor(cddd_descr_masked_aniline).double())
print(regenerate_masked_aniline)
'Sc1ccccc1'

Why cddd doesn't return the padded-smiles? Why does it convert padding character to sulphur?
I tried to pad middle characters too, but it always generated wrong smiles. Can I mask any character in my smiles and generate cddd descriptors, and then convert it back to the original padded-smiles?
Your kind answer will be deeply regarded!

Install fails on windows: fix setup.py

Install with pip fails on windows. Fix in setup.py is to remove the final slash after data/default_model:

package_data={'cddd': ['data/*', 'data/default_model']},

https://docs.python.org/3/distutils/setupscript.html

Prepare raw data for retrain

This is a great work with well documented code (rare)!

Here I'm trying to retrain this model on the same data in the paper. The paper mentioned that the training data is from PubChem and ZINC15: After applying this preprocessing procedure the resulting dataset consisted of approximately 72 million compounds.

I started from ZINC15 and PubChem raw data and find that the number of molecules after preprocessing is far more than 72 million, up to tens of GBytes! So I'm curious about how do you prepare raw data before preprocessing?

Here is what I did:

ZINC15: download from ZINC tranches, with default config, which are Rep.: 2D, React.: Standard, Purch.: Wait OK, total 1285M substances (far more than 72 million even after preprocessing).
PubChem: download all SDF files from PubChem compound: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

Thank you so much for your patience!

Tf-Keras 2

Hi @jrwnter ,
I saw that you still work on this repo with various commit and a Pypi library that has been published.
Do you have plan (on going or in the future) to update your code to Tf 2.0 ?
Best,
Lionel

ValueError: need at least one array to concatenate

Hi,

I encountered the following error while generating molecular embeddings (using "cddd --input smiles.smi --output descriptors.csv --smiles_header smiles") from SMILES input. The occurrence of this error is approximately 20%. Is there any way to avoid this error?

"""
Consider installing the package zmq to utilize the InferenceServer class
start preprocessing SMILES...
finished preprocessing SMILES!
start calculating descriptors...
I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Error: seq2emb
Traceback (most recent call last):
File "/home/wzc/anaconda3/envs/cddd/bin/cddd", line 8, in
sys.exit(main_wrapper())
File "/home/wzc/anaconda3/envs/cddd/lib/python3.6/site-packages/cddd/run_cddd.py", line 99, in main_wrapper
tf.app.run(main=main, argv=[sys.argv[0]] + UNPARSED)
File "/home/wzc/anaconda3/envs/cddd/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/wzc/anaconda3/envs/cddd/lib/python3.6/site-packages/cddd/run_cddd.py", line 80, in main
descriptors = infer_model.seq_to_emb(sml_list)
File "/home/wzc/anaconda3/envs/cddd/lib/python3.6/site-packages/cddd/inference.py", line 127, in seq_to_emb
emb = sequence2embedding(self.encode_model, self.hparams, seq)
File "/home/wzc/anaconda3/envs/cddd/lib/python3.6/site-packages/cddd/inference.py", line 46, in sequence2embedding
embedding_array = np.concatenate(emb_list)
ValueError: need at least one array to concatenate
"""

Best regards.

ERROR: Failed building wheel for cddd

Dear,
When installing the cddd package the ERROR: Failed building wheel for cddd is reported. Detailed output is below:

ERROR: Failed building wheel for cddd
Running setup.py clean for cddd
Failed to build cddd
Installing collected packages: cddd
Running setup.py install for cddd ... error
ERROR: Command errored out with exit status 1:
command: 'C:\Users\erami\anaconda3\envs\cddd\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\erami\AppData\Local\Temp\pip-req-build-m0nh872q\setup.py'"'"'; file='"'"'C:\Users\erami\AppData\Local\Temp\pip-req-build-m0nh872q\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\erami\AppData\Local\Temp\pip-record-ckhscqxe\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\erami\anaconda3\envs\cddd\Include\cddd'
cwd: C:\Users\erami\AppData\Local\Temp\pip-req-build-m0nh872q
Complete output (73 lines):
running install
running build
running build_py
creating build
creating build\lib
creating build\lib\cddd
copying cddd\evaluation.py -> build\lib\cddd
copying cddd\hyperparameters.py -> build\lib\cddd
copying cddd\inference.py -> build\lib\cddd
copying cddd\input_pipeline.py -> build\lib\cddd
copying cddd\models.py -> build\lib\cddd
copying cddd\model_helper.py -> build\lib\cddd
copying cddd\preprocessing.py -> build\lib\cddd
copying cddd\run_cddd.py -> build\lib\cddd
copying cddd\train.py -> build\lib\cddd
copying cddd_init_.py -> build\lib\cddd
creating build\lib\cddd\data
copying cddd\data\download_pretrained.py -> build\lib\cddd\data
copying cddd\data_init_.py -> build\lib\cddd\data
running egg_info
writing cddd.egg-info\PKG-INFO
writing dependency_links to cddd.egg-info\dependency_links.txt
writing entry points to cddd.egg-info\entry_points.txt
writing requirements to cddd.egg-info\requires.txt
writing top-level names to cddd.egg-info\top_level.txt
reading manifest file 'cddd.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'cddd.egg-info\SOURCES.txt'
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\erami\AppData\Local\Temp\pip-req-build-m0nh872q\setup.py", line 30, in
'cddd = cddd.run_cddd:main_wrapper',
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools_init_.py", line 129, in setup
return distutils.core.setup(**attrs)
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\dist.py", line 955, in run_commands
self.run_command(cmd)
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\install.py", line 61, in run
return orig.install.run(self)
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\command\install.py", line 545, in run
self.run_command('build')
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\command\build.py", line 135, in run
self.run_command(cmd_name)
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\dist.py", line 974, in run_command
cmd_obj.run()
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 53, in run
self.build_package_data()
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 118, in build_package_data
for package, src_dir, build_dir, filenames in self.data_files:
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 66, in getattr
self.data_files = self._get_data_files()
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 83, in _get_data_files
return list(map(self._get_pkg_data_files, self.packages or ()))
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 95, in _get_pkg_data_files
for file in self.find_data_files(package, src_dir)
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 114, in find_data_files
return self.exclude_data_files(package, src_dir, files)
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 198, in exclude_data_files
files = list(files)
File "C:\Users\erami\anaconda3\envs\cddd\lib\site-packages\setuptools\command\build_py.py", line 234, in
for pattern in raw_patterns
File "C:\Users\erami\anaconda3\envs\cddd\lib\distutils\util.py", line 127, in convert_path
raise ValueError("path '%s' cannot end with '/'" % pathname)
ValueError: path 'data/default_model/' cannot end with '/'
----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\erami\anaconda3\envs\cddd\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\erami\AppData\Local\Temp\pip-req-build-m0nh872q\setup.py'"'"'; file='"'"'C:\Users\erami\AppData\Local\Temp\pip-req-build-m0nh872q\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\erami\AppData\Local\Temp\pip-record-ckhscqxe\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\erami\anaconda3\envs\cddd\Include\cddd' Check the logs for full command output.

Any hints are appreciated.
Thanks

Restrictions for python version

Dear Cddd team,

We noticed that with the python restrictions you've added to your repo:
python_requires='>=3.6.1, <3.7'
https://github.com/jrwnter/cddd/blame/cc0fd9a3dea4373bf5b69ac9be5c1c5511bcbfcb/setup.py#L15
we cannot create conda environments with newer python versions >=3.7.9.

While with the older cddd version we were able to do this.
I wonder why was this restriction added in the first place?
Could you make the repo to be compatible with newer python versions?

Thank you in advance!

Training New Dataset

Hi, would it be possible to know the basic setup (file to run, dataset to prepare) for training the the cddd model on a new dataset? Thank You

NumPy error when loading data

I'm trying to compute descriptors for my compounds but I keep getting the following error:

ValueError: Object arrays cannot be loaded when allow_pickle=False.

I guess the behavior of np.load() has changed recently and now allow_pickle is False by default.

can't install tensorflow-gpu

I followed your instruction to install TensorFlow-gpu. After pip install ...., I didn't get any error messages. However, when I import TensorFlow, I got a page of error messages. I tried to use conda to install TensorFlow-gpu, it still didn't work.
Tensorflow-cpu works.

sure_chembl_alerts.txt and chembl_fps.npy not copied during pip install

mso/data/sure_chembl_alerts.txt and mso/data/chembl_fps.npy have to be manually copied to the installation folder after pip install .

MacOS fix

Thanks for the interesting SMILES embedding !

Taken from the README.md file, the installation doesn't seem to be supported for MacOS, in particular the following command

pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.10.0-cp36-cp36m-linux_x86_64.whl

A possible fix is

conda install -c conda-forge tensorflow==1.10.0

Another small note:

In the Inference Module, when instantiating the class ( inference_model = InferenceModel() ) , in my case, the default_model folder has to be moved manually to cddd/data/ .

Number of Epochs

For how many steps was the translation model trained for? Is it for the default 250000 steps?

Tag `master` to a specific version

@jrwnter could you tag master to a version so I can package on conda-forge? Thank you!

Training / Fine-Tuning

Hello!

Can the documentation for the training & fine-tuning procedure be added as well? Also, if there is an unofficial /official implementation of the codebase in Pytorch, please do let us know. Thank you !

What is your objectives (properties) for the classification model used to train the encoder?

In your model there is a 3 layer FC model trained to predict "some" properties to make the latent representation consist of more information, but it seems that you haven't describe what properties excactly you used in this process. I wonder are they logP/QED/SA as described in Gómez-Bombarelli's work (for you are training in ZINC as well), or are they logP/partial charge/valence electrons/... as you described in the preprocessing section which extracted for each molecule?