GithubHelp home page GithubHelp logo

mol_opt's Introduction

mol_opt: A Benchmark for Practical Molecular Optimization


GitHub Repo stars GitHub Repo forks

This repository hosts an open-source benchmark for Practical Molecular Optimization (PMO), to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 25 molecular design algorithms on 23 tasks with a particular focus on sample efficiency (oracle calls). The preprint version of the paper is available at https://arxiv.org/pdf/2206.12411.pdf

News

We release a lightweight mol_opt in https://github.com/wenhao-gao/mol-opt It can be installed by pip and be used within three lines of code.

Installation

conda create -n molopt python=3.7
conda activate molopt 
pip install torch 
pip install PyTDC 
pip install PyYAML
conda install -c rdkit rdkit 

We recommend to use PyTorch 1.10.2 and PyTDC 0.3.6.

Then we can activate conda via following command.

conda activate molopt 

29 Methods

Based the ML methodologies, all the methods are categorized into:

  • virtual screening
    • screening randomly search ZINC database.
    • molpal uses molecular property predictor to prioritize the high-scored molecules.
  • GA (genetic algorithm)
    • graph_ga based on molecular graph.
    • smiles_ga based on SMILES
    • selfies_ga based on SELFIES
    • stoned based on SELFIES
    • synnet based on synthesis
  • VAE (variational auto-encoder)
    • smiles_vae based on SMILES
    • selfies_vae based on SELFIES
    • jt_vae based on junction tree (fragment as building block)
    • dog_ae based on synthesis
  • BO (Bayesian optimization)
    • gpbo
  • RL (reinforcement learning)
    • reinvent
    • reinvent_selfies
    • reinvent_transformer
    • graphinvent
    • moldqn
    • smiles_aug_mem
    • smiles_bar
  • HC (hill climbing)
    • smiles_lstm_hc is SMILES-level HC.
    • smiles_ahc is SMILES-level augmented HC.
    • selfies_lstm_hc is SELFIES-level HC
    • mimosa is graph-level HC
    • dog_gen is synthesis based HC
  • gradient (gradient ascent)
    • dst is based molecular graph.
    • pasithea is based on SELFIES.
  • SBM (score-based modeling)
    • gflownet
    • gflownet_al
    • mars

time is the average rough clock time for a single run in our benchmark and do not involve the time for pretraining and data preprocess. We have processed the data, pretrained the model. Both are available in the repository.

assembly additional package time requires_gpu
screening - - 2 min no
molpal - ray, tensorflow, ConfigArgParse, pytorch-lightning 1 hour no
graph_ga fragment joblib 3 min no
smiles_ga SMILES joblib, nltk 2 min no
stoned SELFIES - 3 min no
selfies_ga SELFIES selfies 20 min no
graph_mcts atom - 2 min no
smiles_lstm_hc SMILES guacamol 4 min no
smiles_ahc SMILES 4 min no
selfies_lstm_hc SELFIES guacamol, selfies 4 min yes
smiles_vae SMILES botorch 20 min yes
selfies_vae SELFIES botorch, selfies 20 min yes
jt_vae fragment botorch 20 min yes
gpbo fragment botorch, networkx 15 min no
reinvent SMILES pexpect, bokeh 2 min yes
reinvent_transformer SMILES pexpect, bokeh 2 min yes
reinvent_selfies SELFIES selfies, pexpect, bokeh 3 min yes
smiles_aug_mem SMILES reinvent-models==0.0.15rc1 2 min yes
smiles_bar SMILES reinvent-models==0.0.15rc1 2 min yes
reinvent_selfies SELFIES selfies 3 min yes
moldqn atom networks, requests 60 min yes
mimosa fragment - 10 min yes
mars fragment chemprop, networkx, dgl 20 min yes
dog_gen synthesis extra conda 120 min yes
dog_ae synthesis extra conda 50 min yes
synnet synthesis dgl, pytorch_lightning, networkx, matplotlib 2-5 hours yes
pasithea SELFIES selfies, matplotlib 50 min yes
dst fragment - 120 min no
gflownet fragment torch_{geometric,sparse,cluster}, pdb 30 min yes
gflownet_al fragment torch_{geometric,sparse,cluster}, pdb 30 min yes

Run with one-line code

There are three types of runs defined in our code base:

  • simple: A single run for testing purposes for each oracle, is the defualt.
  • production: Multiple independent runs with various random seeds for each oracle.
  • tune: A hyper-parameter tuning over the search space defined in main/MODEL_NAME/hparam_tune.yaml for each oracle.
## specify multiple random seeds 
python run.py MODEL_NAME --seed 0 1 2 
## run 5 runs with different random seeds with specific oracle 
python run.py MODEL_NAME --task production --n_runs 5 --oracles qed 
## run a hyper-parameter tuning starting from smiles in a smi_file, 30 runs in total
python run.py MODEL_NAME --task tune --n_runs 30 --smi_file XX --other_args XX 

MODEL_NAME are listed in the table above.

Multi-Objective Optimization

Multi-objective optimization is implemented in multiobjective branch. We use "+" to connect multiple properties, please see the command line below.

python run.py MODEL_NAME --oracles qed+jnk3  

Hyperparameters

We separate hyperparameters for task-level control, defined from argparse, and algorithm-level control, defined from hparam_default.yaml. There is no clear boundary for them, but we recommend one keep all hyperparameters in the self._optimize function as task-level.

  • running hyperparameter: parser argument.
  • default model hyperparameter: hparam_default.yaml
  • tuning model hyperparameter: hparam_tune.yaml

For algorithm-level hyperparameters, we adopt the stratforward yaml file format. One should define a default set of hyper-parameters in main/MODEL_NAME/hparam_default.yaml:

population_size: 50
offspring_size: 100
mutation_rate: 0.02
patience: 5
max_generations: 1000

And the search space for hyper-parameter tuning in main/MODEL_NAME/hparam_tune.yaml:

name: graph_ga
method: random
metric:
  goal: maximize
  name: avg_top100
parameters:
  population_size:
    values: [20, 40, 50, 60, 80, 100, 150, 200]
  offspring_size:
    values: [50, 100, 200, 300]
  mutation_rate:
    distribution: uniform
    min: 0
    max: 0.1
  patience:
    value: 5
  max_generations:
    value: 1000

Contribute

Our repository is an open-source initiative. To update a better set of parameters or incldue your model in out benchmark, check our Contribution Guidelines!

mol_opt's People

Contributors

explcre avatar futianfan avatar guojeff avatar morgancthomas avatar wenhao-gao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mol_opt's Issues

Specific command to reproduce results?

From your README.md file it was not 100% clear exactly what command(s) I should run if I want to get results for a new baseline algorithm and extract the AUC top 10 metric as detailed in Table 2 of your paper. You provide several commands to e.g. change the number of seeds or the starting SMILES file, but it is not clear exactly which setting corresponds to the experiment done in the paper.

Can you provide some clarification on this? My best guess is something like the following script, which (as far as I can tell) will run all 23 oracles from your paper with 5 seeds, 10k max oracle calls, and the same starting SMILES which you used in your paper:

oracle_array=('jnk3' 'gsk3b' 'celecoxib_rediscovery' \\
    'troglitazone_rediscovery' \\
    'thiothixene_rediscovery' 'albuterol_similarity' 'mestranol_similarity' \\
    'isomers_c7h8n2o2' 'isomers_c9h10n2o2pf2cl' 'median1' 'median2' 'osimertinib_mpo' \\
    'fexofenadine_mpo' 'ranolazine_mpo' 'perindopril_mpo' 'amlodipine_mpo' \\
    'sitagliptin_mpo' 'zaleplon_mpo' 'valsartan_smarts' 'deco_hop' 'scaffold_hop' 'qed' 'drd2')
for orac in "\${oracle_array[@]}" ; do
    python run.py MODEL_NAME --task production --oracles $orac
done

Afterwards a series of post-processing steps would be needed to extract the results. These are also not super clear to me, and I just made a PR that I think could make this easier (#26). If you are happy to merge #26 then I could write a follow-up PR which parses log file outputs to produce a table like Table 2. I think something like this would lower the barrier towards using the benchmark, and also help ensure that people are using it appropriately 😃

Results Reproduction

Thank you for your work! Every time that I try to reproduce your code the following error appears: 'It appears that you do not have permission to access the requested resource. Please reach out to the project owner to grant you access. If you have the correct permissions, verify that there are no issues with your networking setup.'
Do i need any kind of permission?

Repeat Result

Hi @wenhao-gao ,

I am trying to repleat result of Figure 3 in DST paper. Not sure whether I am understanding this right: is mol_opt run.py script followed by denovo.py from dst repo? like 0.5K in 2K = 1.5K + 0.5K

question about smiles_bar and smiles_aug_mem

hi Jeff,
@GuoJeff

when I read your PR about smiles_bar and smiles_aug_mem, I did not find the command "import reinvent_model", I am very curious about how you import the module reinvent_model. thanks so much for your time!

new commit

@GuoJeff
Hi Jeff, thanks again for your contribution. I further test your PR again, and find smiles_ahc works well, but it has bugs with smiles_aug_mem and smiles_bar.

the error is

ModuleNotFoundError: No module named 'reinvent_models', i assume you write the new module called reinvent_models, right? do you mind share it with us?

It would be great if you can take a look again. I will also try to figure it out.

Question about top auc metric

Thank you for sharing your code and your work on benchmarking.

I have a question regarding the top_auc metrics especially this part:

if finish and len(buffer) < max_oracle_calls:
sum += (max_oracle_calls - len(buffer)) * top_n_now
return sum / max_oracle_calls
.

If I understand correctly, considering we have a dictionary of smiles and their corresponding scores, the conditions will always evaluate to true when the uniqueness is less than 1. This could potentially create issues if the model generates a large number of duplicate molecules.

One edge example would be a model for which when he generates a smiles, will necessarily generate 2 other identical smiles. This means the model will have 1/3 of the 10,000 molecules unique. Let's assume this model learns and so the score of each new unique molecule generated increases across oracle calls. In this case I think the current implementation of the top_auc include a biais. I think it's clearer with graphs:
top_auc = 0.5831
image
top_auc = 0.4995
image

values added to sum consist of [(top_k_scores + prev) / 2] * update_frequency in the for loop (len(scores) - limit_update) * [(top_k_scores + prev) / 2] and (len(results) - len(scores)) * [top_k_scores].

In this case, I think it is apparent that removing duplicates creates a bias to the top_auc computation. I am wondering if I am missing something and if this is expected? Could you enlighten me on this ? What is the intuition behind it ?

Folders for some baselines seem to be missing

Hi, thanks for this wonderful work that brings together a bunch of optimization methods.
However, when I tried to experiment with the provided baselines here, I found that a few methods were not included in the codebase but referenced in main/run.py, e.g. hero, gegl, boss, chemo and rationale_rl.
Could you take a look at what's happening here?

Discrepancies in Reproducing Results for Hyperparameter Tuning in REINVENT

Subject: Inconsistencies in Replicating Results from Supplemental Information (Figures 14 and 15, Section D.2)

Hello,

We've been working on reproducing the results for hyperparameter tuning in REINVENT, specifically for the zaleplon_mpo and perindopril_mpo oracles as presented in Figures 14 and 15, Section D.2 of the supplemental information. Despite following the installation and execution instructions in the README, our results differ from those published.

Issue Details:

  1. Discrepancy in Mean AUC Top-10 for zaleplon_mpo:
  • Published Result: Table-4 reports a mean AUC Top-10 of 0.358±0.062 across five independent runs.
  • Our Result: We observed a mean AUC Top-10 of 0.503±0.02.
  1. Performance Difference Between Sigma Values:
  • Published Behavior: A significant performance difference is reported between sigma values of 500 and 60 (Figure 14, Section D.2).
  • Our Observation: We found minimal performance difference between these sigma values (mean AUC Top-10 of 0.503 for sigma=500 vs 0.482 for sigma=60) for zaleplon_mpo.
  1. Other Discrepancies: We also noted discrepancies in several mean AUC Top-10 values reported in Table-4.

Seeking Clarification:

We would like to thoroughly analyze the behavior of the hyperparameter sigma and ensure the accuracy of our results. Could you please help us verify that our methodology aligns with your implementation? We want to ensure that there are no overlooked mistakes on our end or potential bugs in the code.

Any insights or suggestions you could provide would be greatly appreciated.

Thank you for your assistance.

Question about AUC

I have a question about top_auc metrics especially this part:

temp_result = list(sorted(ordered_results, key=lambda kv: kv[1][0], reverse=True))[:top_n]
top_n_now = np.mean([item[1][0] for item in temp_result])
sum += (len(buffer) - called) * (top_n_now + prev) / 2

If len(buffer)<max_oracle_calls and len(buffer)%freg_log!=0, after the end of for idx in range (freq_log, min (len (buffer), max_oracle_calls), freq_log): xxx, there are still "len (buffer) - called" data that have not been calculated. In this case, the code directly use the top-n of the entire buffer as the top-n-now. I believe the correct calculation method should change line 43 into: temp_result = list(sorted(ordered_results, key=lambda kv: kv[1][0], reverse=True))[called:].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.