molecule-one / megan Goto Github PK

Code for "Molecule Edit Graph Attention Network: Modeling Chemical Reactions as Sequences of Graph Edits"

License: MIT License

Python 99.62% Shell 0.38%

megan's Introduction

Molecule Edit Graph Attention Network: Modeling Chemical Reactions as Sequences of Graph Edits

Code for "Molecule Edit Graph Attention Network: Modeling Chemical Reactions as Sequences of Graph Edits" (https://arxiv.org/abs/2006.15426)

Code was run/tested for:

- python 3.6
- pytorch 1.3.1
- tensorflow 2.0
- rdkit 2020.03.2

Pytorch is used for building, training and evaluating models. CUDA support is recommended.

Tensorflow is used only for visualizing training process (tensorboard). CUDA support is not required.

Environment setup

We recommend running MEGAN in an isolated conda environment, which can be created with:

conda env create -f env.yml

Edit env.sh file so it suits your configuration, if necessary. Before running any scripts, run:

source env.sh

This activates the conda environment and sets a few environment values.

Download training/evaluation data

For USPTO-50k, the data needs to be first manually downloaded from: https://www.dropbox.com/sh/6ideflxcakrak10/AAAESdZq7Y0aNGWQmqCEMlcza/typed_schneider50k and unpacked to the data/uspto_50k folder (Thanks to the authors of https://github.com/Hanjun-Dai/GLN for providing the data).

The following scripts download the datasets and generate train/val/test split:

python bin/acquire.py uspto_50k  # assumes that raw data is in data/uspto_50k
python bin/acquire.py uspto_mit
python bin/acquire.py uspto_full

Preprocessing training data

The following scripts build graph representation of data needed to train MEGAN:

python bin/featurize.py uspto_50k megan_16_bfs_randat
python bin/featurize.py uspto_mit megan_for_8_dfs_cano
python bin/featurize.py uspto_full megan_32_bfs_randat

Datasets and featurizers are defined in src/config.py.

By default, featurization is multithreaded with number of jobs equal to the number of CPUs. It can be changed by:

N_JOBS=N python bin/featurize.py uspto_full megan_32_bfs_randat

where N is an integer >= 1

Training

python bin/train.py uspto_50k models/uspto_50k
python bin/train.py uspto_50k_rt models/uspto_50k_rt
python bin/train.py uspto_mit models/uspto_mit_mix
python bin/train.py uspto_mit_sep models/uspto_mit_sep

This trains models with the same configuration as we describe in the paper.

We use gin-config (https://github.com/google/gin-config) for managing training hyperparameters. Gin configuration files are in configs. Configuration values can also be passed as script parameters like:

python bin/train.py uspto_50k models/uspto_50k --learning_rate 0.5 --n_encoder_conv 8

Training takes from about 10 hours for USPTO-50k to about 60 hours for USPTO-FULL on a single Nvidia GeForce GTX 1070 GPU.

Evaluation

python bin/eval.py models/uspto_50k --beam-size 50 --show-every 100
python bin/eval.py models/uspto_50k_rt --beam-size 50 --show-every 100
python bin/eval.py models/uspto_mit_mix --beam-size 10 --show-every 1000
python bin/eval.py models/uspto_mit_sep --beam-size 10 --show-every 1000
python bin/eval.py models/uspto_full --beam-size 50 --show-every 1000

For evaluation script we use argh, so _ in parameter names are replaced with -. Evaluation can take long time, especially for large beam sizes (up to a couple of hours for USPTO-FULL with beam size 50).

Evaluation produces two files: eval_*.txt has calculated Top K values, pred_*.txt contains predicted SMILES and actions.

Packed data and models

We include packed pre-processed data, as well as weights of the model trained on USPTO-50k for two variants (reaction type unknown/reaction type given) as a GitHub Release with version number v1.1 in this repo. To use data and pretrained models, unpack the "megan_data.zip" archive in the root directory of the project.

megan's People

Contributors

Stargazers

Watchers

Forkers

chaoyan1037 jmgx30 corochann aspirincode milkigit kudkudak pk-organics linminhtoo lgjun lilleswing dianamata44 keyu-tian proevgenii piotrmwojcik rnaimehaom lyndonlens zlatkojoncev kmaziarz svpur

megan's Issues

Nice work - are you aware of the USPTO data leak? (+ training issue on HPC)

Hi authors, this is a very nice work. I enjoyed reading the paper as it was well explained. In particular your top-20 and top-50 are very high, beating works like GLN and impressive. Also thanks for being one of the few repos that provide proper documentation + env.yaml for easy install.

However (and as much as I hate to break it to you), I'm not sure if you are aware of the USPTO data leak. Please refer to https://github.com/uta-smile/RetroXpert, a concurrent work as yours, about this issue.

In short, the 1st mapped atom (id: 1) in the product SMILES (and reactant SMILES) is usually the reaction centre. So, if you directly use the original atom mapping in the USPTO dataset to set the order of generation/edits for molecular graphs, I think there's a subtle but important data leak that's going to make it easier for the model to generate the correct reactants (since it's going to implicitly figure out that the first atom in most products (and consequently, reactants) should be the reaction centre and thus going to need some sort of modification, rather than having to learn which atom is really the reaction centre). The authors of RetroXpert have been alerted to this leak and implemented a corrected canonicalization pipeline to ensure that the atom-map ordering is truly canonical.

I have re-trained RetroXpert on that truly-canonical USPTO-50K and their top-1 is significantly lower at ~45% (and so are the remaining top-K's). It makes their work no longer SOTA.

I took a brief look at MEGAN's data scripts but I didn't see any canonicalization part. I am interested in whether you have already corrected for this issue, or if you have already re-ran/are re-running the experiments. Thanks.

Are the data split exactly the same as those in GLN and G2Gs?

The results of baselines in your paper are copied directly from the GLN and G2Gs paper. But I am wondering if you reuse the same training/validaion/test set? Or did you just randomly split the 50K reactions into 80%/10%/10%? Thanks.

Segmentation error (core dumped)

I tried to reproduce the result of USPTO-50K, however I ran into the Segmentation error like this:

python bin/eval.py models/uspto_50k --beam-size 50 --show-every 100
2021-05-28 16:54:43,187 - src - INFO - Setting random seed to 132435
2021-05-28 16:54:45,910 - __main__ - INFO - Creating model...
2021-05-28 16:54:45,910 - __main__ - INFO - Loading data...
/home/shuanchen/anaconda3/envs/megan/lib/python3.6/site-packages/numpy/core/fromnumeric.py:61: FutureWarning: Series.nonzero() is deprecated and will be removed in a future version.Use Series.to_numpy().nonzero() instead
  return bound(*args, **kwds)
2021-05-28 16:54:47,224 - __main__ - INFO - Evaluating on 5030 samples from test
models/uspto_50k beam search on test:   0%|          | 0/5030 [00:00<?, ?it/s]Segmentation fault (core dumped)

How can I solve this problem?

Your model can process the multi-edits?

Hi,
Could you tell me whether your method can process the multi-edits?

Thanks!

Lower results for forward-synthesis on USPTO-MIT

I have been reproducing following the same instructions you gave in the readme file, however, I am unable to reach the same top-k accuracy results as mentioned in the paper. Please note that I am training on USPTO-MIT (mixed version). Results are as above:

I did not set external training parameters, just followed the same way mentioned in repository. Is that because I should pass additional parameters to match your implementation? Any comment would be appreciated.

Speedup for training

https://github.com/molecule-one/megan/blob/master/src/model/megan.py#L19

I think this line can be replaced by:
if torch.cuda.is_available():
one_hot = torch.cuda.FloatTensor(*x.shape, dims).zero_()
else:
one_hot = torch.FloatTensor(*x.shape, dims).zero_().to(device)

As a result, the cpu usage can be reduced, leading to faster training.

License

Thank you for your effort and for sharing the code. Would you add a license to the repo? I'd like to use the code but can't since it has no license.

Is there any args for evaluating with reaction class unknown?

Question about the evaluation metrics

Hi, your work is impressive. I have a little question about your evaluation in https://github.com/molecule-one/megan/blob/master/bin/eval.py.

Based on my understanding, normally, for each product, a model predicts m reactants. Then the top-k is calculated as the fraction of samples whose top-k predicted reactants contain at least one ground truth reactant.

In your evaluation, it seems like for each product A, you predict several complete reactant sets, denoted as A1, A2, A3, ranked by a set score. Then, the top-k is calculated as the fraction of samples within whose top k predicted sets, there is a set that completely matches the ground truth reactant set. I am wondering if it is more strict than the normal top-k evaluation, as you need to predict all the reactants correctly? In other words, is it fair to compare this "top-k" with others' "top-k"?

Please correct me if I make anything wrong. Thanks.

CUDA out of memory

It's ok when I train on USPTO-50K and CUDA out of memory on USPTO-full