GithubHelp home page GithubHelp logo

hgraph2graph's Introduction

Hierarchical Generation of Molecular Graphs using Structural Motifs

Our paper is at https://arxiv.org/pdf/2002.03230.pdf

Installation

First install the dependencies via conda:

  • PyTorch >= 1.0.0
  • networkx
  • RDKit >= 2019.03
  • numpy
  • Python >= 3.6

And then run pip install .. Additional dependency for property-guided finetuning:

  • Chemprop >= 1.2.0

Data Format

  • For graph generation, each line of a training file is a SMILES string of a molecule
  • For graph translation, each line of a training file is a pair of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see data/qed/train_pairs.txt. The test file is a list of molecules to be optimized. Please see data/qed/test.txt.

Molecule generation pretraining procedure

We can train a molecular language model on a large corpus of unlabeled molecules. We have uploaded a model checkpoint pre-trained on ChEMBL dataset in ckpt/chembl-pretrained/model.ckpt. If you wish to train your own language model, please follow the steps below:

  1. Extract substructure vocabulary from a given set of molecules:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt
  1. Preprocess training data:
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/all.txt --ncpu 16 --mode single
mkdir train_processed
mv tensor* train_processed/
  1. Train graph generation model
mkdir ckpt/chembl-pretrained
python train_generator.py --train train_processed/ --vocab data/chembl/vocab.txt --save_dir ckpt/chembl-pretrained
  1. Sample molecules from a model checkpoint
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsamples 1000

Property-guided molcule generation procedure (a.k.a. finetuning)

The following script loads a trained Chemprop model and finetunes a pre-trained molecule language model to generate molecules with specific chemical properties.

mkdir ckpt/finetune
python finetune_generator.py --train ${ACTIVE_MOLECULES} --vocab data/chembl/vocab.txt --generative_model ckpt/chembl-pretrained/model.ckpt --chemprop_model ${YOUR_PROPERTY_PREDICTOR} --min_similarity 0.1 --max_similarity 0.5 --nsample 10000 --epoch 10 --threshold 0.5 --save_dir ckpt/finetune

Here ${ACTIVE_MOLECULES} should contain a list of experimentally verified active molecules.

${YOUR_PROPERTY_PREDICTOR} should be a directory containing saved chemprop model checkpoint.

--max_similarity 0.5 means any novel molecule should have nearest neighbor similarity lower than 0.5 to any known active molecules in ${ACTIVE_MOLECULES}` file.

--nsample 10000 means to sample 10000 molecules in each epoch.

--threshold 0.5 is the activity threshold. A molecule is considered as active if its predicted chemprop score is greater than 0.5.

In each epoch, generated active molecules are saved in ckpt/finetune/good_molecules.${epoch}. All the novel active molecules are saved in ckpt/finetune/new_molecules.${epoch}

Molecule translation training procedure

Molecule translation is often useful for lead optimization (i.e., modifying a given molecule to improve its properties)

  1. Extract substructure vocabulary from a given set of molecules:
python get_vocab.py --ncpu 16 < data/qed/mols.txt > vocab.txt

Please replace data/qed/mols.txt with your molecules.

  1. Preprocess training data:
python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16
mkdir train_processed
mv tensor* train_processed/
  1. Train the model:
mkdir ckpt/translation
python train_translator.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir ckpt/translation
  1. Make prediction on your lead compounds (you can use any model checkpoint, here we use model.5 for illustration)
python translate.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model ckpt/translation/model.5 --num_decode 20 > results.csv

Polymer generation

The polymer generation code is in the polymer/ folder. The polymer generation code is similar to train_generator.py, but the substructures are tailored for polymers. For generating regular drug like molecules, we recommend to use train_generator.py in the root directory.

hgraph2graph's People

Contributors

wengong-jin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

hgraph2graph's Issues

Key error during preprocesssing

After generating the vocab via the example command given in the generation directory, I ran the preprocessing.py example command on the same dataset, and every 50 batches or so a key error occurs. Example -

image

If i manually add this to the vocab.txt file it references, it continues on until the next key error caused by the absence of a different fragment. If i keep manually adding the missing keys, it eventually works after a dozen or so, but this is a fairly annoying process. I use the same --min_frequency as the repository, and noticed that if i reduce it, the vocabulary increases on the order of thousands of fragments - yet, the preprocessing step ends up working if i manually add in just a few dozen from the key error messages. Is there something I am doing wrong here?

Question about constraint optimization

Is there a way to use constraint optimization to find novel molecules with higher desired property?
I read the paper about your previous work (Junction Tree Variational Autoencoder) and wonder if it is feasible to jointly train HeirG2G with a property predictor and use gradient ascent to find novel similar molecules.
Thanks!

no gnn_train.py in generation folder

Thank you for your interesting work.

For generation task,
In Readme,
mkdir -p ckpt/tmp
python gnn_train.py --train train_processed/ --vocab ../data/polymers/inter_vocab.txt --save_dir ckpt/tmp

but there is no 'gnn_train.py' in generation folder.
Could you check this? thank you in advance!

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first

Hello,

I am using the finetune_generator.py and I am getting the following error. does anybody know how to solve this.

Thanks,

Loading from checkpoint ckpt/inforna-pretrained/model.ckpt.5000
After pruning 257 -> 257
Epoch 0 training...
0%| | 0/13 [00:25<?, ?it/s]
Traceback (most recent call last):
File "/apps/hgraph2graph/20210428/hgraph2graph/finetune_generator.py", line 152, in
meters = meters + np.array([kl_div, loss.item(), wacc * 100, iacc * 100, tacc * 100, sacc * 100])
File "/apps/hgraph2graph/20210428/lib/python3.7/site-packages/torch/_tensor.py", line 732, in array
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Getting error while running pip install .

Hi,

I installed all the dependencies via conda in an environment called jtvae. Then, after cloning the repository, I ran:

(jtvae) hgraph2graph$ pip install .

But I got the following error:

Processing /home/homedir/projects/repos/hgraph2graph
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [14 lines of output]
      error: Multiple top-level packages discovered in a flat-layout: ['data', 'ckpt', 'props', 'hgraph', 'polymers', 'val_scripts'].
      
      To avoid accidental inclusion of unwanted files or directories,
      setuptools will not proceed with this build.
      
      If you are trying to create a single distribution with multiple packages
      on purpose, you should not rely on automatic discovery.
      Instead, consider the following options:
      
      1. set up custom discovery (`find` directive with `include` or `exclude`)
      2. use a `src-layout`
      3. explicitly set `py_modules` or `packages` with a list of names
      
      To find more information, look for "package discovery" on setuptools docs.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(jtvae) hgraph2graph$

Any idea what might be going on? I am running Ubuntu 22.04.

how to convert the chembl checkpoint kept file to an binary format

Hi Everyone,

The pretrained chembl model has been provided which is really helpful. But to generate the new lead compounds, similar to that shown in step 4 in molecular translation procedure by using translate.py script. it requires a binary file but the pretrained chembl checkpoint provide is in ckpt format.
In molecular generation procedure the step 4, suggests to sample the molecules from the model checkpoint which does take an ckpt file. this doesnt generate any new compounds similar to translate.py.
How do i generate new compounds using the chembl model.ckpt file.
Help would be really appreciated.

Thank you

Getting error during vocab generation for graph translation.

First of all, thank you for your wonderful research for molecule generation.
I wanted to make vacab for my dataset (from L1000 dataset).
Before vocab generation, I converted my SMILES to canonical SMILES.
However, I got this error continuously when I run vocab.py.
I think there were some problems regarding tree decomposition.
Is there any solution for this error?
I will wait your comment.
Thanks.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "get_vocab.py", line 11, in process
    hmol = MolGraph(s)
  File "/home/sejeong/generative_model/codes/hgraph2graph/hgraph/mol_graph.py", line 22, in __init__
    self.mol_tree = self.tree_decomp()
  File "/home/sejeong/generative_model/codes/hgraph2graph/hgraph/mol_graph.py", line 83, in tree_decomp
    assert n - m <= 1  # must be connected
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "get_vocab.py", line 30, in <module>
    vocab_list = pool.map(process, batches)
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
AssertionError

All assm_labels are zero?

Hi Wengong,

In the decode process, it seems all the assm_labels are 0.
Why do we need to train the get_assm_score if "#the label is always the first of assm_cands"?

all_assm_preds.append( (cand_vecs, batch_idx, 0) ) #the label is always the first of assm_cands

Getting error while generating vocabulary

Hello Wengong !

Thanks for the great work !!

I am trying to get vocabulary using your dataset < ../data/polymers/all.txt > ; however, I am getting this error. I cannot figure this out. At the end I tried try-exception there but there are lots of these errors in the whole run. I will appreciate if you could assist me.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "[...]\Anaconda3\envs\myenv\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "[...]\Anaconda3\envs\myenv\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "[...]\hgraph2graph-master\hgraph2graph-master\generation\get_vocab.py", line 12, in process
    hmol = MolGraph(s)
  File "[...]\hgraph2graph-master\hgraph2graph-master\generation\poly_hgraph\mol_graph.py", line 29, in __init__
    self.clusters, self.atom_cls = self.pool_clusters()
  File "[...]\hgraph2graph-master\hgraph2graph-master\generation\poly_hgraph\mol_graph.py", line 87, in pool_clusters
    **if fsmiles not in MolGraph.FRAGMENTS: continue**
TypeError: argument of type 'NoneType' is not iterable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]/hgraph2graph-master/generation/get_vocab.py", line 62, in <module>
    vocab_list = pool.map(process, batches) # getting error here TypeError: argument of type 'NoneType' is not iterable
  File "[...]\Anaconda3\envs\myenv\lib\multiprocessing\pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "[...]\Anaconda3\envs\myenv\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
TypeError: argument of type 'NoneType' is not iterable

Some motifs in generated vocabulary are not parseable for rdkit

I was trying to build our customized language models. I found the pattern "C1=CC=CCNCCcc[cH:1]CC=CCCCC=CCCC=CCCCCC=C1" generated by "get_vocab.py" are not parseable for rdkit.

So when I ran the "preprocess.py", it would report an error on hgraph2graph/hgraph/vocab.py line 65, in count_inters:
inters = [a for a in mol.GetAtoms() if a.GetAtomMapNum() > 0]
AttributeError: 'NoneType' object has no attribute 'GetAtoms'

It is because within the function vocab.py::count_inters, the code tried to covert smile to mol:
line 64: mol = Chem.MolFromSmiles(s)

I would appreciate someone can provide a solution.

chemprop_model details

python finetune_generator.py --train ${ACTIVE_MOLECULES} --vocab data/chembl/vocab.txt --generative_model ckpt/chembl-pretrained/model.ckpt --chemprop_model ${YOUR_PROPERTY_PREDICTOR} --min_similarity 0.1 --max_similarity 0.5 --nsample 10000 --epoch 10 --threshold 0.5 --save_dir ckpt/finetune

The details about "chemprop_model" is not given properly. Is it related to chemprop python package? chemprop python package can train a model for property prediction. However I could not find a way to save the model which I could use in here.

Thanks
Prosun

Regarding structural motifs (bee hives)

Hi,

I have a question for the tree decomposition function below, especially related to structural motifs (bee hives).

def tree_decomp(self):
clusters = self.clusters
graph = nx.empty_graph( len(clusters) )
for atom, nei_cls in enumerate(self.atom_cls):
if len(nei_cls) <= 1: continue
bonds = [c for c in nei_cls if len(clusters[c]) == 2]
rings = [c for c in nei_cls if len(clusters[c]) > 4] #need to change to 2
if len(nei_cls) > 2 and len(bonds) >= 2:
clusters.append([atom])
c2 = len(clusters) - 1
graph.add_node(c2)
for c1 in nei_cls:
graph.add_edge(c1, c2, weight = 100)
elif len(rings) > 2: #Bee Hives, len(nei_cls) > 2
clusters.append([atom]) #temporary value, need to change
c2 = len(clusters) - 1
graph.add_node(c2)
for c1 in nei_cls:
graph.add_edge(c1, c2, weight = 100)
else:
for i,c1 in enumerate(nei_cls):
for c2 in nei_cls[i + 1:]:
inter = set(clusters[c1]) & set(clusters[c2])
graph.add_edge(c1, c2, weight = len(inter))
n, m = len(graph.nodes), len(graph.edges)
assert n - m <= 1 #must be connected
return graph if n - m == 1 else nx.maximum_spanning_tree(graph)

It seems structural motifs are not extracted from the code above. For example, converting a polymer smiles using mol_graph

echo "Cc1cc2c(cc1C)c1cc(-c3cc4c5nn(C)nc5c5cc(-c6cc7c8cc(C)c(C)cc8c8ccsc8c7s6)sc5c4s3)sc1c1sccc21" | python hgraph/mol_graph.py

results only bonds and single rings, not bee hives.

[(0, 1), (6, 7), (10, 11), (16, 17), (22, 23), (28, 29), (30, 31), (1, 6, 5, 4, 3, 2), (9, 8, 46, 45, 10), (12, 11, 44, 43, 13), (15, 14, 19, 18, 16), (21, 20, 42, 41, 22), (24, 23, 40, 39, 25), (27, 28, 30, 32, 33, 26), (35, 34, 38, 37, 36), (48, 47, 51, 50, 49), (3, 51, 47, 46, 8, 4), (13, 43, 42, 20, 19, 14), (25, 39, 38, 34, 33, 26)]
{0: ('CC', 'C[CH3:1]'), 1: ('CC', 'C[CH3:1]'), 2: ('CC', 'C[CH3:1]'), 3: ('CN', 'C[NH2:1]'), 4: ('CC', 'C[CH3:1]'), 5: ('CC', 'C[CH3:1]'), 6: ('CC', 'C[CH3:1]'), 7: ('C1=CC=CC=C1', 'C1=CC=[CH:1]C=C1'), 8: ('C1=CSC=C1', 'C1=C[CH:1]=[CH:1]S1'), 9: ('C1=CSC=C1', 'C1=CS[CH:1]=C1'), 10: ('C1=N[NH]N=C1', 'N1=[CH:1][CH:1]=N[NH]1'), 11: ('C1=CSC=C1', 'C1=C[CH:1]=[CH:1]S1'), 12: ('C1=CSCC1', 'C1=[CH:1]SCC1'), 13: ('C1=CCCC=C1', 'C1=C[CH2:1][CH2:1]C=C1'), 14: ('C1=CSCC1', 'C1=C[CH2:1][CH2:1]S1'), 15: ('C1=CSC=C1', 'C1=C[CH:1]=[CH:1]S1'), 16: ('C1=CC=CC=C1', 'C1=C[CH:1]=[CH:1]C=C1'), 17: ('C1=CCCC=C1', 'C1=C[CH:1]=[CH:1]CC1'), 18: ('C1=CC=CC=C1', 'C1=CC=[CH:1][CH:1]=C1')}

Any suggestions?

Generation example not working

I downloaded the package and ran from the generation folder the suggested process :
python get_vocab.py --min_frequency 100 --ncpu 8 < ../data/polymers/all.txt > ../data/polymers/vocab.txt
python preprocess.py --train ../data/polymers/train.txt --vocab data/polymers/vocab.txt --ncpu 8

I get the following error:
"""
Traceback (most recent call last):
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "preprocess.py", line 19, in tensorize
x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
File "/home/cristian/Work/hgraph2graph/generation/poly_hgraph/mol_graph.py", line 168, in tensorize
tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
File "/home/cristian/Work/hgraph2graph/generation/poly_hgraph/mol_graph.py", line 209, in tensorize_graph
fnode[v] = vocab[attr]
File "/home/cristian/Work/hgraph2graph/generation/poly_hgraph/vocab.py", line 43, in getitem
return self.hmap[x[0]], self.vmap[x]
KeyError: ('C1=CSC=N1', 'N1=[CH:2]S[CH:2]=[CH:1]1')
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "preprocess.py", line 49, in
all_data = pool.map(func, batches)
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
KeyError: ('C1=CSC=N1', 'N1=[CH:2]S[CH:2]=[CH:1]1')

AssertionError

Hi gurus,
pls, I need your help. I am trying to run the get-vocab.py on my small dataset around 100. but keep getting this error as shown below:
Is there a way to go around this. the reference for the error is to the mol_graph.py line82:
"assert n - m <= 1 #must be connected"
image

IndexError: tuple index out of range

Hello,

I am following the the example for "Molecule generation pretraining procedure". first step "python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt" is done with no error, but I am getting the "IndexError: tuple index out of range" for the second step
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/all.txt --ncpu 16 --mode single

can you please let me know what could be the problem.

Best,
Amir

README error

After you generate the vocabulary in the first step of the README,

python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt 

the next line should be:

python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single

Otherwise, you get the following error:

IndexError: tuple index out of range

Vocabulary

Hi,
Thank you so much for sharing the code!
I want to use it and am wondering if it is fair to generate the vocabulary based on all data. In NLP, for example, it is standard to exclude the test data. (I guess you have a reason I am just curious.) Thank you already!

RuntimeError in in the polymers/ folder

Hi Wengong,

Thank you for sharing this wonderful work.

I try to use the code of polymers folder to generate molecule based on ZINK250K data. I got the data from the github of your junction tree folder.
I first used get_vocab.py to get vocab data and use preprocess.py to get train data.

When i run the code, i got the following error,
Namespace(anneal_rate=0.9, atom_vocab=<poly_hgraph.vocab.Vocab object at 0x2b5a6fb51e10>, batch_size=20, beta=0.3, clip_norm=20.0, depthG=20, depthT=20, diterG=5, diterT=1, dropout=0.0, embed_size=250, epoch=20, hidden_size=250, latent_size=24, load_epoch=-1, lr=0.001, print_iter=50, rnn_type='LSTM', save_dir='models/', save_iter=-1, train='train_processed/', vocab='zinc_vocab.txt')
/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/_reduction.py:46: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Model #Params: 5742K
Traceback (most recent call last):
File "vae_train.py", line 80, in
loss, kl_div, wacc, iacc, tacc, sacc = model(*batch, beta=beta)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/gpfs/scratchfs1/zhg19014/motifgeneration/hgraph2graph/polymers/poly_hgraph/hgnn.py", line 76, in forward
loss, wacc, iacc, tacc, sacc = self.decoder((root_vecs, tree_vecs, graph_vecs), graphs, tensors, orders)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/gpfs/scratchfs1/zhg19014/motifgeneration/hgraph2graph/polymers/poly_hgraph/decoder.py", line 254, in forward
topo_scores = self.get_topo_score(src_tree_vecs, batch_idx, topo_vecs)
File "/gpfs/scratchfs1/zhg19014/motifgeneration/hgraph2graph/polymers/poly_hgraph/decoder.py", line 137, in get_topo_score
return self.topoNN( torch.cat([topo_vecs, topo_cxt], dim=-1) ).squeeze(-1)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 92, in forward
return F.linear(input, self.weight, self.bias)
File "/home/zhg19014/.conda/envs/hgraph2graph/lib/python3.6/site-packages/torch/nn/functional.py", line 1406, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [872 x 500], m2: [274 x 250] at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/generic/THCTensorMathBlas.cu:268

Thanks.

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625

When following the instructions of the README.md neither of the commands shown, seem to work out of the box.
So far I added the py_modules=['hgraph'] in the setup.py and added ",clearAromaticFlags=True)" in the chemutils.py file.

Sample from checkpoint does not work:
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000

So I tried to reproduce the vocab with:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > new_vocab.txt
It works. But, new_vocab.txt has 5625 lines and data/chembl/vocab.txt 5623. And there are multiple differences, not just two.

Do you have any way to sample from checkpoint without issues?
Also, why am I getting a different vocab result from the same data/chembl/all.txt file? Is there some random operation? I left all random seeds as they are in the scripts.

About the novelty of the model

Could you please provide the Novelty(the fraction of unique valid generated molecules not present in the training set) of your model? It seems that this important evaluation metric is missing in the paper.

Unrecognized arguments: --nsamples 1000

Hi

After run the example:
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsamples 1000

I got the error when passing the argument --nsamples

usage: generate.py [-h] --vocab VOCAB [--atom_vocab ATOM_VOCAB] --model MODEL
[--seed SEED] [--nsample NSAMPLE] [--rnn_type RNN_TYPE]
[--hidden_size HIDDEN_SIZE] [--embed_size EMBED_SIZE]
[--batch_size BATCH_SIZE] [--latent_size LATENT_SIZE]
[--depthT DEPTHT] [--depthG DEPTHG] [--diterT DITERT]
[--diterG DITERG] [--dropout DROPOUT]
generate.py: error: unrecognized arguments: --nsamples 1000

When I remove this arguments everything is ok.

PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python failed")'

Hi experts,
kindly help out with solution to this error I'm getting. I want to generate vocabs for my Transition metal complexes dataset as shown below:

(/mnt/c/Users/User/Desktop/mol-generation/env) aorubuloye@ORUBULOYE-PC:/mnt/c/Users/User/Desktop/mol-generation/hgraph2graph$ python get_vocab.py --ncpu 16 < data/catalystchem/all.txt > vocab_2.txt

**

Traceback (most recent call last):
File "/mnt/c/Users/User/Desktop/mol-generation/hgraph2graph/get_vocab.py", line 32, in
vocab_list = pool.map(process, batches)
File "/mnt/c/Users/User/Desktop/mol-generation/env/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/mnt/c/Users/User/Desktop/mol-generation/env/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f38017d2d00>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed")'

**

Question about the meaning of atom type classification

Thanks for your active reply!

When I follow your code and paper since I am a newbie of GNN, I have a question.

As I understand, generally node classification means a kind of node clustering.

Yet when I read your paper, there is node type (atom type) classification.

I guess it is predicting atom type among H, C, ... etc not the node clustering.

Am I right? In that case, could you let me know which part of your code is atom type classification?

Thanks in advance.

Chem.Kekulize(mol) error

rdkit version: 2021.03.3
(didn't check in other versions)

Doing
python get_vocab.py --ncpu 16 < aromatic.txt > vocab.txt with aromatic SMILES,
an error occurs because of Chem.Kekulize function.

In chemutils.py,

Before:

def get_mol(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None: Chem.Kekulize(mol)
    return mol

After: adding clearAromaticFlags=True

def get_mol(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None: Chem.Kekulize(mol, clearAromaticFlags=True)
    return mol

Solved.

Error raise when run preprocess.py in generation folder

Hi,
I run preprocess.py in generation folder and an error raise:

python preprocess.py --train ../data/polymers/train.txt --vocab ../data/polymers/inter_vocab.txt --ncpu 8 

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/was/.conda/envs/torch14/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/was/.conda/envs/torch14/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "preprocess.py", line 19, in tensorize
    x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
  File "/data/was/capsulegraphvae/hg2g/generation/poly_hgraph/mol_graph.py", line 169, in tensorize
    tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
  File "/data/was/capsulegraphvae/hg2g/generation/poly_hgraph/mol_graph.py", line 210, in tensorize_graph
    fnode[v] = vocab[attr]
  File "/data/was/capsulegraphvae/hg2g/generation/poly_hgraph/vocab.py", line 43, in __getitem__
    return self.hmap[x[0]], self.vmap[x]
KeyError: 'O=C1NC(=O)C2=C(F)C=C3C(=O)NC(=O)C4=C3C2=C1C=C4F'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 48, in <module>
    all_data = pool.map(func, batches)
  File "/home/was/.conda/envs/torch14/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/was/.conda/envs/torch14/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value
KeyError: 'O=C1NC(=O)C2=C(F)C=C3C(=O)NC(=O)C4=C3C2=C1C=C4F'

but I cannot find the smiles "O=C1NC(=O)C2=C(F)C=C3C(=O)NC(=O)C4=C3C2=C1C=C4F" in the texts. Do you know what happens and how to debug? Thank you in advance.

Cant kekulize molecule while running get_vocab.py

while running get_vocab.py on an experiment dataset of the first 40 lines in data/chembl/all.txt , i get the following error message :
Screenshot 2024-07-07 at 12 26 50 AM

I've tried going to the respective line in chem_utils.py and checking it out, but everything seems to be fine there. What could be the issue?

size mismatch when load chembl-pretrained ckpt

can not load pretrained ckpt, and set strict = False does not help.

python generate.py --vocab vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000
/root/miniconda3/lib/python3.7/site-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Traceback (most recent call last):
File "generate.py", line 44, in
model.load_state_dict(torch.load(args.model)[0] , strict = False)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for HierVAE:
size mismatch for encoder.E_c.0.weight: copying a param with shape torch.Size([1578, 250]) from checkpoint, the shape in current model is torch.Size([1576, 250]).
size mismatch for encoder.E_i.0.weight: copying a param with shape torch.Size([5623, 250]) from checkpoint, the shape in current model is torch.Size([5625, 250]).
size mismatch for decoder.hmpn.E_c.0.weight: copying a param with shape torch.Size([1578, 250]) from checkpoint, the shape in current model is torch.Size([1576, 250]).
size mismatch for decoder.hmpn.E_i.0.weight: copying a param with shape torch.Size([5623, 250]) from checkpoint, the shape in current model is torch.Size([5625, 250]).
size mismatch for decoder.E_assm.0.weight: copying a param with shape torch.Size([5623, 250]) from checkpoint, the shape in current model is torch.Size([5625, 250]).
size mismatch for decoder.clsNN.3.weight: copying a param with shape torch.Size([1578, 250]) from checkpoint, the shape in current model is torch.Size([1576, 250]).
size mismatch for decoder.clsNN.3.bias: copying a param with shape torch.Size([1578]) from checkpoint, the shape in current model is torch.Size([1576]).
size mismatch for decoder.iclsNN.3.weight: copying a param with shape torch.Size([5623, 250]) from checkpoint, the shape in current model is torch.Size([5625, 250]).
size mismatch for decoder.iclsNN.3.bias: copying a param with shape torch.Size([5623]) from checkpoint, the shape in current model is torch.Size([5625]).

CUDA memory usage high for large --num_decode

Memory usage is high when generating large numbers of output smiles from a single input. Perhaps the intent is to only generate a few output smiles based on a single input? But when generating large numbers of smiles from a single input, the code has significant usability issues as per below error (reliance on memory bound data structure). In the below case the model was trained on ~20K mols -> 100K pairs and theoretically should have enough diversity to generate a large number of changes to the input smiles:

python decode.py --test single_smiles.smi --vocab training.vocab --model ./models/model.10 --num_decode 10000 --batch_size 1

Traceback (most recent call last):
File "../hgraph2graph/decode.py", line 69, in
new_mols = model.translate(batch[1], args.num_decode, args.enum_root, args.greedy)
File "/hpc/scratch/nvme1/HeirVAE/hgraph2graph/hgraph/hgnn.py", line 96, in translate
return self.decoder.decode( (root_vecs, z_tree_vecs, z_graph_vecs), greedy=greedy)
File "/hpc/scratch/nvme1/HeirVAE/hgraph2graph/hgraph/decoder.py", line 322, in decode
hinter = HTuple( mess = self.rnn_cell.get_init_state(tree_tensors[1]) )
File "/hpc/scratch/nvme1/HeirVAE/hgraph2graph/hgraph/rnn.py", line 76, in get_init_state
c = torch.zeros(len(fmess), self.hidden_size, device=fmess.device)
RuntimeError: CUDA out of memory. Tried to allocate 2.01 GiB (GPU 0; 10.92 GiB total capacity; 8.94 GiB already allocated; 1.29 GiB free; 9.21 GiB reserved in total by PyTorch)

If anyone has done fine tuning, please share environment.yaml

Looking at the issues here, I'm adjusting the environment. If you use chemprop 1.2 or 1.3, it will ask you to install descriptastorus. When I install it, I get an import error because scipy.bigrat is missing.

So when I install the latest chemprop it automatically changes to rdkit 2023. It is difficult to match the version
Please share environment.yaml

Getting error when run vae_train.py

First of all, Thank you for your great research on molecule generation.
Nowadays, I am training my ZINC datasets with your vae_train.py (in generation folder).
When I run the code, I got the error like below.
This error occur occasionally. I think it depends on the batch.
Is there any solution for this problem?

  warnings.warn(warning.format(ret))
Model #Params: 160850K
[50] Beta: 0.100, KL: 19.11, loss: 57.167, Word: 10.76, 52.60, Topo: 80.77, Assm: 56.73, PNorm: 175.70, GNorm: 18.64
[100] Beta: 0.100, KL: 9.08, loss: 42.075, Word: 14.69, 59.69, Topo: 93.39, Assm: 75.03, PNorm: 236.81, GNorm: 14.60
[150] Beta: 0.100, KL: 9.66, loss: 39.316, Word: 16.71, 62.58, Topo: 96.62, Assm: 77.06, PNorm: 293.82, GNorm: 17.42
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [32,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "vae_train.py", line 81, in <module>
    loss, kl_div, wacc, iacc, tacc, sacc = model(*batch, beta=beta)
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sejeong/hgraph2graph/generation/poly_hgraph/hgnn.py", line 88, in forward
    root_vecs, tree_vecs, _, graph_vecs = self.encoder(tree_tensors, graph_tensors)
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sejeong/hgraph2graph/generation/poly_hgraph/encoder.py", line 130, in forward
    hatom,_ = self.graph_encoder(*tensors)
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sejeong/hgraph2graph/generation/poly_hgraph/encoder.py", line 30, in forward
    h = self.rnn(fmess, bgraph)
  File "/home/sejeong/anaconda3/envs/PSJ/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sejeong/hgraph2graph/generation/poly_hgraph/rnn.py", line 105, in forward
    h,c = self.LSTM(fmess, h_nei, c_nei)
  File "/home/sejeong/hgraph2graph/generation/poly_hgraph/rnn.py", line 92, in LSTM
    c = i * u + (f * c_nei).sum(dim=1)
RuntimeError: CUDA error: device-side assert triggered

Extracting latent vector for a molecule

Hi,

I've successfully trained a generation model on my set of molecules, and I'm able to sample from it with generate.py. However, I was wondering if it's possible to easily extract the latent vectors for a set of input molecules I used as the training set?

I've previously used your JTNN model, and it was easily done with encode_latent function you had in your model class. However, it seems like HierVAE in hgnn.py does not have such a function. Could you point me to any helper functions I would need to invoke to encode a molecule after training and get its latent vector?

Thank you!

Size of tensors do not match

Hello Wengong !

Thanks for the great work !
I am currently trying to train the VAE on ChemBL, however, after about 300 gradient descent steps, I encounter the following error :

Traceback (most recent call last):
  File "vae_train.py", line 108, in <module>
    loss, kl_div, wacc, iacc, tacc, sacc = model(*batch, beta=beta)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home//PycharmProjects/hgraph2graph/generation/poly_hgraph/hgnn.py", line 79, in forward
    loss, wacc, iacc, tacc, sacc = self.decoder((root_vecs, root_vecs, root_vecs), graphs, tensors, orders)
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home//PycharmProjects/hgraph2graph/generation/poly_hgraph/decoder.py", line 251, in forward
    cand_vecs = self.enum_attach(hgraph, cands, icls, nth_child)
  File "/home//PycharmProjects/hgraph2graph/generation/poly_hgraph/decoder.py", line 293, in enum_attach
    cand_vecs = torch.cat([cand_vecs, icls_vecs, order_vecs], dim=-1)
RuntimeError: Sizes of tensors must match except in dimension 0. Got 146 and 292

Would you have any pointers to a possible source ?
The error is not easy to reproduce as it seems to occur at random stages of the training.

Thanks !

TypeError: forward() missing 2 required positional arguments: 'y_orders' and 'beta'

I am using the train_translator.py for lead optimization and I am getting the following error. does anybody know how I can fix this?

Thanks,

Traceback (most recent call last):
File "/apps/hgraph2graph/20210428/hgraph2graph/train_translator.py", line 86, in
loss, kl_div, wacc, iacc, tacc, sacc = model(*batch)
File "/apps/hgraph2graph/20210428/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'y_orders' and 'beta'

error with the latest version of rdkit (2021.03.1)

Hello,

I just wanted to share that the code does not run properly if using the latest version of rdkit (2021.03.1). I tried to run the examples from the readme file python get_vocab.py < data/qed/mols.txt > vocab.txt and got the following error:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/brian/opt/anaconda3/envs/hiervae/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/Users/brian/opt/anaconda3/envs/hiervae/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "get_vocab.py", line 10, in process
    hmol = MolGraph(s)
  File "/Users/brian/Desktop/git/hgraph2graph/hgraph/mol_graph.py", line 22, in __init__
    self.order = self.label_tree()
  File "/Users/brian/Desktop/git/hgraph2graph/hgraph/mol_graph.py", line 112, in label_tree
    cmol, inter_label = get_inter_label(mol, cls, inter_atoms)
  File "/Users/brian/Desktop/git/hgraph2graph/hgraph/chemutils.py", line 142, in get_inter_label
    new_mol = get_clique_mol(mol, atoms)
  File "/Users/brian/Desktop/git/hgraph2graph/hgraph/chemutils.py", line 111, in get_clique_mol
    smiles = Chem.MolFragmentToSmiles(mol, atoms, kekuleSmiles=True)
rdkit.Chem.rdchem.AtomKekulizeException: non-ring atom 7 marked aromatic
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "get_vocab.py", line 29, in <module>
    vocab_list = pool.map(process, batches)
  File "/Users/brian/opt/anaconda3/envs/hiervae/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/Users/brian/opt/anaconda3/envs/hiervae/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
rdkit.Chem.rdchem.AtomKekulizeException: non-ring atom 7 marked aromatic

I also faced a similar error when running the preprocessing script.

This Kekulize error has been mentioned in the latest RDKit release update:

https://github.com/rdkit/rdkit/releases

"Setting the kekuleSmiles argument (doKekule in C++) to MolToSmiles will now
cause the molecule to be kekulized before SMILES generation. Note that this
can lead to an exception being thrown. Previously this argument would only
write kekulized SMILES if the molecule had already been kekulized (#2788)"

So far, my workaround has been to downgrade RDkit to an older version. I seem to no longer run into these errors after testing rdkit=2018.09.1.0, although it would be nice to know what version of rdkit was used in the original implementation.

Pretrained chembl model requires old rdkit

In order to use the model checkpoint trained on chembl, you need to be on rdkit=2019.03.4, which isn't mentioned in the readme. If you're on a newer version, you'll get a KeyError when the model tries to look up SMILES in its vocabulary. I know this repo is sparsely maintained, so I'm mostly leaving this as a search term for anyone else who wants to use that checkpoint in the future.

MolGraph for cyclic structures

In the previous implementation of mol_graph for JTVA (https://github.com/wengong-jin/icml18-jtnn/blob/1d298810e193ce2eef1252f653ea2a3794bdf66b/fast_jtnn/chemutils.py) the code included the following block which merged rings, this step seems to be missing from the new implementation. Could you please clarify if this is the case.

  #Merge Rings with intersection > 2 atoms
  for i in range(len(cliques)):
      #only focus on rings skip other cliques
      if len(cliques[i]) <= 2: continue
      for atom in cliques[i]:
          for j in nei_list[atom]:
              if i >= j or len(cliques[j]) <= 2: continue
              inter = set(cliques[i]) & set(cliques[j])
              if len(inter) > 2:
                  cliques[i].extend(cliques[j])
                  cliques[i] = list(set(cliques[i]))
                  cliques[j] = []
  cliques = [c for c in cliques if len(c) > 0]  `

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.