rxn4chemistry / rxnmapper Goto Github PK

RXNMapper: Unsupervised attention-guided atom-mapping. Code complementing our Science Advances publication on "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" (https://advances.sciencemag.org/content/7/15/eabe4166).

Home Page: http://rxnmapper.ai

License: MIT License

Makefile 0.96% Python 99.04%

chemistry reactions rxn atom-mapping smiles transformer

rxnmapper's Introduction

Python wrapper for the IBM RXN for Chemistry API

A python wrapper to access the API of the IBM RXN for Chemistry website.

Install

From PYPI:

pip install rxn4chemistry

Or directly from the repo:

pip install git+https://github.com/rxn4chemistry/rxn4chemistry.git

Usage

By default, the wrapper connects to the https://rxn.res.ibm.com server. This can be overriden by setting an environment variable. To set a different url, simply do:

export RXN4CHEMISTRY_BASE_URL="https://some.other.rxn.server"

The base url can be directly set when instantiating the RXN4ChemistryWrapper (this will overwrite the environment variable):

api_key = 'API_KEY'
from rxn4chemistry import RXN4ChemistryWrapper

rxn4chemistry_wrapper = RXN4ChemistryWrapper(api_key=api_key, base_url='https://some.other.rxn.server')
# or set it afterwards
# rxn4chemistry_wrapper = RXN4ChemistryWrapper(api_key=api_key)
# rxn4chemistry_wrapper.set_base_url('https://some.other.rxn.server')

Create a project

Get your API key from here and build the wrapper:

api_key = 'API_KEY'
from rxn4chemistry import RXN4ChemistryWrapper

rxn4chemistry_wrapper = RXN4ChemistryWrapper(api_key=api_key)
# NOTE: you can create a project or set an esiting one using:
# rxn4chemistry_wrapper.set_project('PROJECT_ID')
rxn4chemistry_wrapper.create_project('test_wrapper')
print(rxn4chemistry_wrapper.project_id)

Reaction outcome prediction

Running a reaction outcome prediction is as simple as:

response = rxn4chemistry_wrapper.predict_reaction(
    'BrBr.c1ccc2cc3ccccc3cc2c1'
)
results = rxn4chemistry_wrapper.get_predict_reaction_results(
    response['prediction_id']
)
print(results['response']['payload']['attempts'][0]['smiles'])

Extracting actions from a paragraph describing a recipe

Extract the actions from a recipe:

results = rxn4chemistry_wrapper.paragraph_to_actions(
    'To a stirred solution of '
    '7-(difluoromethylsulfonyl)-4-fluoro-indan-1-one (110 mg, '
    '0.42 mmol) in methanol (4 mL) was added sodium borohydride '
    '(24 mg, 0.62 mmol). The reaction mixture was stirred at '
    'ambient temperature for 1 hour.'
)
print(results['actions'])

Retrosynthesis prediction

Predict a retrosynthetic pathway given a product:

response = rxn4chemistry_wrapper.predict_automatic_retrosynthesis(
    'Brc1c2ccccc2c(Br)c2ccccc12'
)
results = rxn4chemistry_wrapper.get_predict_automatic_retrosynthesis_results(
    response['prediction_id']
)
print(results['status'])
# NOTE: upon 'SUCCESS' you can inspect the predicted retrosynthetic paths.
print(results['retrosynthetic_paths'][0])

See here for a more comprehensive example.

Biocatalysed retrosynthesis prediction

Predict a biocatalysed retrosynthetic pathway given a product by specifying the model trained on biocatalysed reactions:

response = rxn4chemistry_wrapper.predict_automatic_retrosynthesis(
    'OC1C(O)C=C(Br)C=C1', ai_model='enzymatic-2021-04-16'
)
results = rxn4chemistry_wrapper.get_predict_automatic_retrosynthesis_results(
    response['prediction_id']
)
print(results['status'])
# NOTE: upon 'SUCCESS' you can inspect the predicted retrosynthetic paths.
print(results['retrosynthetic_paths'][0])

Prediction of reaction properties (atom-to-atom mapping, reaction yield, ...)

Prediction of atom-to-atom mapping (see paper):

response = rxn4chemistry_wrapper.predict_reaction_properties(
    reactions=[
        "CC(C)S.CN(C)C=O.Fc1cccnc1F.O=C([O-])[O-].[K+].[K+]>>CC(C)Sc1ncccc1F",
        "C1COCCO1.CC(C)(C)OC(=O)CONC(=O)NCc1cccc2ccccc12.Cl>>O=C(O)CONC(=O)NCc1cccc2ccccc12",
        "C=CCN=C=S.CNCc1ccc(C#N)cc1.NNC(=O)c1cn2c(n1)CCCC2>>C=CCN1C(C2=CN3CCCCC3=N2)=NN=C1N(C)CC1=CC=C(C#N)C=C1",
    ],
    ai_model="atom-mapping-2020",
)
for predicted_mapping_dict in response["response"]["payload"]["content"]:
    print(predicted_mapping_dict["value"])

Prediction of reaction yields (see paper):

response = rxn4chemistry_wrapper.predict_reaction_properties(
    reactions=[
        "Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.COc1ccc(OC)c(P([C@]23C[C@H]4C[C@H](C[C@H](C4)C2)C3)[C@]23C[C@H]4C[C@H](C[C@H](C4)C2)C3)c1-c1c(C(C)C)cc(C(C)C)cc1C(C)C.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.Cc1cc(C)on1>>Cc1ccc(Nc2ccccn2)cc1",
        "Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.COc1ccc(OC)c(P([C@]23C[C@H]4C[C@H](C[C@H](C4)C2)C3)[C@]23C[C@H]4C[C@H](C[C@H](C4)C2)C3)c1-c1c(C(C)C)cc(C(C)C)cc1C(C)C.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.COC(=O)c1ccno1>>Cc1ccc(Nc2ccccn2)cc1",
    ],
    ai_model="yield-2020-08-10",
)
for predicted_yield_dict in response["response"]["payload"]["content"]:
    print(predicted_yield_dict["value"])

Create a synthesis and start it on the robot (or simulator)

Create a synthesis from a retrosynthesis sequence:

# Each retrosynthetic path predicted has a unique sequence_id that can
# be used to create a new synthesis
response = rxn4chemistry_wrapper.create_synthesis_from_sequence(
    sequence_id=results['retrosynthetic_paths'][0]['sequenceId']
)
print(response['synthesis_id'])

# get the entire list of actions for the entire synthesis, as well as a tree representation
synthesis_tree, ordered_tree_nodes, ordered_list_of_actions = rxn4chemistry_wrapper.get_synthesis_plan(
    synthesis_id=response['synthesis_id']
)
for action in ordered_list_of_actions:
    print(action)

synthesis_status_result = rxn4chemistry_wrapper.start_synthesis(
    synthesis_id=response['synthesis_id']
)
print(synthesis_status_result['status'])

synthesis_status_result = rxn4chemistry_wrapper.get_synthesis_status(
    synthesis_id=response['synthesis_id']
)
print(synthesis_status_result['status'])

Forward prediction in batch

It is possible to run a batch of forward reaction predictions without linking them to a project:

response = rxn4chemistry_wrapper.predict_reaction_batch(precursors_list=['BrBr.c1ccc2cc3ccccc3cc2c1', 'Cl.c1ccc2cc3ccccc3cc2c1']*5)
# wait for the predictions to complete
time.sleep(2)
print(rxn4chemistry_wrapper.get_predict_reaction_batch_results(response["task_id"]))

NOTE: the results for batch prediction are not stored permanently in our databases, so we strongly recommend to save them since they will expire.

Prediction of multiple reaction outcomes (in batch)

It is also possible to predict multiple forward reaction prediction outcomes in batch:

response = rxn4chemistry_wrapper.predict_reaction_batch_topn(
    precursors_lists=[
        ["BrBr", "c1ccc2cc3ccccc3cc2c1"],
        ["BrBr", "c1ccc2cc3ccccc3cc2c1CCO"],
    ],
    topn=3,
)
# wait for the predictions to complete
time.sleep(2)
print(rxn4chemistry_wrapper.get_predict_reaction_batch_topn_results(response["task_id"]))

NOTE: the results for batch prediction are not stored permanently in our databases, so we strongly recommend to save them since they will expire.

Enable logging

Logging by the library is disabled by default as it may interfere with programmatic uses.

In the very top of the rxn4chemistry_tour.ipynb example notebook you can see a line that enables all logging in the notebook.

import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s : %(message)s')

This may also enable logging from other libraries. If you wish to selectively enable the logs from rxn4chemistry, consider something like this:

import logging
logger = logging.getLogger("rxn4chemistry")
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(levelname)s : %(message)s'))
logger.addHandler(handler)
logger.setLevel(logging.DEBUG)

Examples

To learn more see the examples.

Documentation

The documentation is hosted here using GitHub pages.

rxnmapper's People

Contributors

Stargazers

Watchers

rxnmapper's Issues

pip error with python 3.11

Hello,

I met an error when attempting to install rxnmapper with python v3.11. Here how to reproduce:

conda create -n test python=3.11
pip install rxnmapper

Is there any chance to fix it? Thank you!

Here are the logs: pip-error.log

Could you share the script to train the model?

As title. Thanks!

Can you share the code to reproduce the figure 4(A)?

Related the previous issue #34
Can you share the code to reproduce the figure 4(a) of Schwaller et al., Sci. Adv. 2021 ?
As you said, I tried to reproduce the result with the same rdkit version('2020.03.3'), but I couldn't reproduce it.
Or can you tell me what functions in the code you used?

Error while generating Atom-atom mapping

Hello,

I am trying to generate AAM for the reaction https://www.rhea-db.org/rhea/56485.

smiles="O=O.O=O.O=O.CC1(C)CC[C@@]2(CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)CC[C@H](O)C(C)(C)[C@@H]5CC[C@@]34C)[C@@H]2C1)C([O-])=O.Cc1cc2Nc3c([nH]c(=O)[nH]c3=O)N(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2Nc3c([nH]c(=O)[nH]c3=O)N(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2Nc3c([nH]c(=O)[nH]c3=O)N(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C>>CC1(C)CC[C@@]2(CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)CC[C@H](O)[C@](C)([C@@H]5CC[C@@]34C)C([O-])=O)[C@@H]2C1)C([O-])=O.[H+].[H+].[H+].[H+].O.O.O.O.Cc1cc2nc3c(nc(=O)[n-]c3=O)n(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2nc3c(nc(=O)[n-]c3=O)n(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2nc3c(nc(=O)[n-]c3=O)n(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C"
print(len(smiles))
mapper=RXNMapper();
tokenize=mapper.tokenize_for_model(smiles)
print(len(tokenize))
mapped = mapper.get_attention_guided_atom_maps([smiles])
mapped

The tokenizer length comes out to be 504 but I still get the error stating:
"Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors"

Could someone please check why the token index is 513 and not 504?

Script to finetune model?

Hi @pschwllr,

Thanks for sharing your mdoel! We found the trained model's performance on our dataset is not satisfactory and would like to finetune the model on our dataset. Could you please also share the training scripts?

use get_attention_guided_atom_maps on generic compound

Hello,

I want to launch the function get_attention_guided_atom_maps on generic compounds with and * (Any atom) in the smile. But in this case i have this error : ValueError: could not broadcast input array from shape (53,) into shape (52,)

code to reproduce ths error

from rxnmapper import RXNMapper
rxn_mapper = RXNMapper()
res = rxn_mapper.get_attention_guided_atom_maps(["O[*].CC(=O)SCCNC(=O)CCNC(=O)[C@H](O)C(C)(C)COP(OP([O-])(OC[C@@H]1([C@@H](OP([O-])([O-])=O)[C@@H](O)[C@@H](O1)N2(C3(\\N=C/N=C(C(\\N=C/2)=3)/N))))=O)([O-])=O>>CC(=O)O[*].CC(C)(COP([O-])(=O)OP(OC[C@H]3(O[C@@H](N1(C2(\\N=C/N=C(C(\\N=C/1)=2)/N)))[C@H](O)[C@H](OP([O-])(=O)[O-])3))(=O)[O-])[C@@H](O)C(=O)NCCC(=O)NCCS"])

If i remove the [*] in the smile of reaction this function works. How can we run this function with an * ?

thanks

rxn module not present in 0.2.0 release -- ModuleNotFoundError

Installing the latest rxnmapper from pypi and running the following in python:

from rxnmapper import RXNMapper

results in a module not found error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/*/projects/rxnmapper/rxnmapper/__init__.py", line 6, in <module>
    from .core import RXNMapper  # noqa
  File "/Users/*/projects/rxnmapper/rxnmapper/core.py", line 11, in <module>
    from rxn.chemutils.reaction_equation import ReactionEquation
ModuleNotFoundError: No module named 'rxn'

False atom mapping result.

Hi,
Considering mapping this reaction:
'C=CC(C=C)C(O)C23CC1CC(CC(C1)C2)C3>>C=CC(C=C)C(Br)C23CC1CC(CC(C1)C2)C3'
It gives:
'OCH:6[C:8]12[CH2:9][CH:10]3[CH2:11]CH:12[CH2:17]2>>[CH2:1]=[CH:2]CH:3CH:6[C:8]12[CH2:9][CH:10]3[CH2:11]CH:12[CH2:17]2'

It seems the Br atom is incorrectly mapped, thank you very much for your help.

If a low confidence score means that the obtained mapped_rxn is invalid and can not use it?

Hi, I got my mapped_rxn results with low confidence scores, among 0.2~0.8. Does it mean that I can not use those results with too low score?

Error when installing from PIP

I have followed the installation instructions using PIP with some minimal changes:

# Small PIP downgrade to solve this bug: https://stackoverflow.com/a/26372051
conda create -n rxnmapper python=3.6 pip=20.2.1
conda activate rxnmapper
conda install -c rdkit rdkit=2020.03.3.0
pip install rxnmapper
# Specific version of Torch for my GPU
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

I then ran the sample code provided in the "Basic Usage" section in a file called test.py

from rxnmapper import RXNMapper
rxn_mapper = RXNMapper()
rxns = ['CC(C)S.CN(C)C=O.Fc1cccnc1F.O=C([O-])[O-].[K+].[K+]>>CC(C)Sc1ncccc1F', 'C1COCCO1.CC(C)(C)OC(=O)CONC(=O)NCc1cccc2ccccc12.Cl>>O=C(O)CONC(=O)NCc1cccc2ccccc12']
results = rxn_mapper.get_attention_guided_atom_maps(rxns)

But this yielded the following error:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    rxn_mapper = RXNMapper()
  File "/home/villalbamartin/.conda/envs/rxnmapper/lib/python3.6/site-packages/rxnmapper/core.py", line 65, in __init__
    self.model, self.tokenizer = self._load_model_and_tokenizer()
  File "/home/villalbamartin/.conda/envs/rxnmapper/lib/python3.6/site-packages/rxnmapper/core.py", line 92, in _load_model_and_tokenizer
    vocab_path, max_len=model.config.max_position_embeddings
  File "/home/villalbamartin/.conda/envs/rxnmapper/lib/python3.6/site-packages/rxnmapper/tokenization_smiles.py", line 45, in __init__
    self.max_len_single_sentence = self.max_len - 2
AttributeError: 'SmilesTokenizer' object has no attribute 'max_len'

I believe that this error is due to changes in the transformers library. pip freeze reveals that the installed version is 4.1.1,
but the property "max_len" was only present in Tokenizers until version 3.5.1 - here's a link to the deprecation notice. For backwards compatibility reasons, some traces of it remain in the current transformers library.

The reason why I emphasize "believe" is because the Github install succeeded, even though it freezes transformers to 4.0.0 which should exhibit the same issue but, in my case, doesn't. Perhaps a pre-trained model with the old property is being loaded in one situation and not in the other, but I haven't looked deep enough to be certain.

why not use AlbertForMaskedLM to load the model path?

config.json

 "architectures": [
    "AlbertForMaskedLM"
  ],

core.py

MODEL_TYPE_DICT = {
    "bert": BertModel,
    "albert": AlbertModel,
    "roberta": RobertaModel
}
self.model_type=
        model_class = MODEL_TYPE_DICT[self.model_type]
        model = model_class.from_pretrained(
            self.model_path,
            output_attentions=True,
            output_past=False,
            output_hidden_states=False,
        )

I am confused that why not use AlbertForMaskedLM to load the model path.

model =  AlbertForMaskedLM.from_pretrained(
            self.model_path,
            output_attentions=True,
            output_past=False,
            output_hidden_states=False,
        )

ValueError when processing reaction smiles

Hello,

First I appreciate the development and documentation that has gone in making this tool plus the user-friendly web interface developed to make interpretations easier.

Now to the issue:

When I run a list of smiles for generating an atom mapping using a default instance of Rxnmapper I get the following error:

[xx:xx:xx] Explicit valence for atom # 5 N, 4, is greater than permitted

Traceback (most recent call last):
  File "...rxnmapper/lib/python3.6/site-packages/rxnmapper/attention.py", line 58, in __init__
    ">>"
ValueError: '>>' is not in list

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "----.py", line 30, in <module>
    results = rxn_mapper.get_attention_guided_atom_maps(split_df.rxn.to_list())
  File "--/rxnmapper/lib/python3.6/site-packages/rxnmapper/core.py", line 207, in get_attention_guided_atom_maps
    detailed_output,  # Return attentions when detailed output requested
  File "--/rxnmapper/attention.py", line 64, in __init__
    "rxn smiles is not a complete reaction. Can't find the '>>' to separate the products"
ValueError: rxn smiles is not a complete reaction. Can't find the '>>' to separate the products

Now I do test to check if all the rxn entries have been encoded with a >> so that doesn't seem to be a concern. What I think might be happening is an except with respect to the valency that is halting the code.

Following is a minimal code snippet of the script being run:

rxn_mapper = RXNMapper()

for index, split_df in enumerate(np.array_split(df_rxn, 100)):
    # Split the dataframe for ease of computation                                                                                                                                                                                                                                
    rxn_smiles = []
    atom_maps = []
    confidence = []
    rxn_id = []
    results = rxn_mapper.get_attention_guided_atom_maps(split_df.rxn.to_list())
    rxn_smiles.append(split_df.rxn.to_list())
    atom_maps.append([ entry['mapped_rxn'] for entry in results ])
    confidence.append([ entry['confidence'] for entry in results ])
    rxn_id.append( list(split_df.Reaction_ID) )

    save_dict = {'rxn_smiles':rxn_smiles, 'atom_maps':atom_maps, 'confidence':confidence, 'rxn_id':rxn_id}
    with open('atom_map_trxn_{1}.pkl'.format(index), 'wb') as f:
        pickle.dump(save_dict, f)

    print('Done with rxnmapper split:{}'.format(index))
    del split_df, results

Would like to know steps on resolving this. Thank you.

if the atom mapping is invalid, should we remove this reaction when trainning model?

Hello, if with rxnmapper adding atom mapping, the reaction 's atom mapping is invalid or the confidence score is low, should I discard this reaction while trainning model?

with rdkit, we could validate if the atom mapping is valid or not, i am not sure if I should filter out all the invalid reaction with invalid atom mapping. Thanks.

segmentation fault

Would you kindly help us solve the following issue:

We have a large number of reactions (~20000) that we need to map, however, when we run a script that goes through list of reactions and maps it one by one it is always finished with "####(some number like 4929) segmentation fault".

We tried to find correlations between smirks and error frequencies, but did not succeed so that we came up with conclusion that these errors occur randomly. We also tried to map reactions which broke the mapper distinctly, without cycle, and they were mapped succesfuly.

rxn_mapper = RXNMapper()
mapped_reaction = rxn_mapper.get_attention_guided_atom_maps([rea])

PyCharm in debug mode showed the following error message: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV).

MacOS 10.15.7
Python 3.7

Imbalanced Stoichiometry/ Missing Atom Mapping

Hi all, I was hoping to get some help/ information with regards to mapping reactions which are imbalanced. I have a pipeline which utilises rxnmapper for AAM, downstream a template is extracted and then validated. On a private dataset there is an unacceptable amount of failed validations.

For reproducibility in this issue I have checked for the same validation failures on USPTO. For USPTO these failures appear to relate back to: mapping where the reaction is imbalanced / there is implied stoichiometry and atom map numbers are not repeated on the RHS / missing.

Some examples:

example_n,original_uspto,pre_mapping,post_mapping
0,[C:1](#[N:4])[CH:2]=[CH2:3].[NH3:5]>>[C:1]([CH2:2][CH2:3][NH:5][CH2:3][CH2:2][C:1]#[N:4])#[N:4],C=CC#N.N>>N#CCCNCCC#N,[CH2:2]=[CH:7][C:8]#[N:9].[NH3:1]>>[N:1]#[C:2][CH2:3][CH2:4][NH:5][CH2:6][CH2:7][C:8]#[N:9]
1,[CH2:1]([C:4]#[N:5])[CH2:2][OH:3]>O>[C:4]([CH2:1][CH2:2][O:3][CH2:2][CH2:1][C:4]#[N:5])#[N:5],N#CCCO.O>>N#CCCOCCC#N,[N:1]#[C:2][CH2:3][CH2:4][OH:5]>>[N:1]#[C:2][CH2:3][CH2:4][O:5][CH2:6][CH2:7][C:8]#[N:9]
2,[CH2:1]([NH2:4])[CH2:2][NH2:3].[CH:5]([CH:7]=O)=O>>[NH:3]1[CH:2]2[NH:3][CH2:5][CH2:7][NH:4][CH:1]2[NH:4][CH2:1][CH2:2]1,NCCN.O=CC=O>>C1CNC2NCCNC2N1,O=[CH:1][CH:7]=O.[CH2:2]([NH2:3])[CH2:4][NH2:5]>>[CH2:1]1[CH2:2][NH:3][CH:4]2[NH:5][CH2:6][CH2:7][NH:8][CH:9]2[NH:10]1

it appears that there should be 2x "C=CC#N" and maybe that atom maps 2, 7, 8 and 9 should be repeated on the RHS?
2x "N#CCCO" are required, 1-4 " "
2x "NCCN" are required, 2-5 " "

Is it realistic of me to expect that rxnmapper should handle imbalanced reactions in the manner I've described or should these be balanced prior to mapping? Please let me know if any further information or examples are required.

Appreciate your assistance with this.

Other information

This is using rxnmapper 0.2.0, outcome is the same irrespective of canonicalize_rxns=False or True.

Note that SMILES in 'pre_mapping' have undergone a standardisation process and those in 'post_mapping' have had molecules with no atom mapping removed.

Dubious mapping case

Hello,

I am mapping a reaction smiles as follows:

smiles = "[O-][Cl]=O>>[Cl-].O=O"; RXNMapper().get_attention_guided_atom_maps([smiles])

and the mapped smiles are

[O-:1][Cl+:3][O-:2]>>[O:1]=[O:2].[Cl-:3]

why is the =O in the substrate broken in the mapped smiles?

https://www.rhea-db.org/rhea/21406

what's the value of the per_gpu_train_batch_size in the process of trainning the model?

training_args = TrainingArguments(
    output_dir="./",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_gpu_train_batch_size=8,  
    save_steps=10_000,
    save_total_limit=2,
)

what's the value of the per_gpu_train_batch_size when train the model?

Boolean index did not match

AttributeError: 'SmilesTokenizer' object has no attribute 'max_l

Hi,

I got this problem when trying to run the example

Traceback (most recent call last):
  File "atom_mapping.py", line 2, in <module>
    rxn_mapper = RXNMapper()
  File "C:\Users\tangu\anaconda3\lib\site-packages\rxnmapper\core.py", line 65, in __init__
    self.model, self.tokenizer = self._load_model_and_tokenizer()
  File "C:\Users\tangu\anaconda3\lib\site-packages\rxnmapper\core.py", line 91, in _load_model_and_tokenizer
    tokenizer = SmilesTokenizer(
  File "C:\Users\tangu\anaconda3\lib\site-packages\rxnmapper\tokenization_smiles.py", line 45, in __init__
    self.max_len_single_sentence = self.max_len - 2
AttributeError: 'SmilesTokenizer' object has no attribute 'max_len

Please help to solve it,

Thank you!

does the vocab.txt come from the training file "uspto_all_reactions_training.txt"?

import rdkit
from rxnmapper.tokenization_smiles import  BasicSmilesTokenizer
basicTok = BasicSmilesTokenizer()
fh = open("uspto_all_reactions_training.txt")
vocabs = set()
for line in fh:
    if line:
        text =line.strip()
        cabs = basicTok.tokenize(text)
        vocabs |= set(cabs)





fh = open("vocab.txt")
vocabs_v1 = set()
for line in fh:
    vocabs_v1.add(line.strip())

print(len(vocabs)) 
print(len(vocabs_v1)) 
print(len(vocabs_v1&vocabs)) 
# print(vocabs_v1- vocabs)
# print(vocabs - vocabs_v1)

output:

>>> print(len(vocabs))
478
>>> print(len(vocabs_v1))
591
>>> print(len(vocabs_v1&vocabs))
430

the vocab.txt come from what reactions?

Boolean index did not match

~/opt/anaconda3/envs/REL/lib/python3.7/site-packages/rxnmapper/attention.py in pxr_filt_atoms(self)
171 """PXR only the atoms, no special tokens"""
172 if self._pxr_filt_atoms is None:
--> 173 self._pxr_filt_atoms = self.pxr[[i != -1 for i in self.pnums]][
174 :, [i != -1 for i in self.rnums]
175 ]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 47 but corresponding boolean dimension is 18

I got this error when running the examples.ipynb, do you know what it could be?

Training dataset of rxnmapper?

I wonder whether the trained rxnmapper model was trained by the 1k data if USPTO50K described in the paper.
If not, is it possible to provide the model trained by the 1k reactions?
Any help about above questions would be greatly appreciated.

Cannot run on M1 Mac

Hello,

I cannot get RXNMapper to run on my M1 Mac. Attempts to set up the environment according to README.md and run the code in terminal led to the following:

zsh: illegal hardware instruction python

Running rxnmapper example in Jupiter led to kernel dying and Spyder failed to import RXNMapper() altogether.

Please advise on the next steps.

UPDATE: seems like there may be an issue with torch as it may not work with arm64 architecture

Can not reproduce the result of Fig 4.

Hi all, I’m hoping to get some help/ information to reproduce the result of figure 4(a) in the paper.(Extraction of organic chemistry grammar from unsupervised learning of chemical reactions) I want to utilize rxnmapper for making a pipeline, but before using this, I tested the performance of rxnmapper by reproducing the figure 4(a). But the accuracy is lower than I expected. Therefore I want to ask 2 questions.

The way how to preprocessing the USPTO data.
First I downloaded the data from https://ibm.box.com/v/RXNMapperData. And I used the ‘test_natcomm.json’ file.
In that file, there are ‘rxn’ and ‘CORRECT MAPPING’.
I used ‘CORRECT MAPPING’ values as ground truths and used ‘rxn’ values as input for rxnmapper model.
And I compared outputs of rxnmapper and ‘CORRECT MAPPING’ values to get accuracy.

with open(‘./RXNMapperData/Test/test_natcomm.json') as f:
    f_ = json.load(f)

#f_[’CORRECT MAPPING’][’24’] 
mapped_rxn = '[CH3:1][CH2:2][n:3]1[c:4](-[c:13]2[cH:14][cH:15][cH:16][c:17]3[cH:18][cH:19][cH:20][cH:21][c:22]23)[n:5][c:6]([F:7])[c:8]1[Si:9]([CH3:10])([CH3:11])[CH3:12]>>[CH3:1][CH2:2][n:3]1[c:4](-[c:13]2[cH:14][cH:15][cH:16][c:17]3[cH:18][cH:19][cH:20][cH:21][c:22]23)[n:5][c:6]([F:7])[cH:8]1.[CH4:10].[CH4:11].[CH4:12].[SiH4:9]'
#f_[’rxn’][’24’] 
rxn = 'CCn1c(-c2cccc3ccccc23)nc(F)c1[Si](C)(C)C>>C.C.C.CCn1cc(F)nc1-c1cccc2ccccc12.[Si]'

gt = process_reaction_with_product_maps_atoms(mapped_rxn,True)
result = rxn_mapper.get_attention_guided_atom_maps([rxn])
pred = process_reaction_with_product_maps_atoms(result[0]['mapped_rxn'], True)

if gt == pred :
    accuracy = True
else:
    accuracy = False

In this case accuracy was False. And there were 248 False case out of 682 cases.
(process_reaction_with_product_maps_atoms function is imported from smiles_utils.py in the rxnmapper directory.)
And accuracy is following(Among USPTO 281 cases used in figure 4(a)):
Number of bond changes : accuracy
1: 78%
2: 88%
3: 74%
4: 78%
5: 58%
6: 87%

So, is there a difference between the way I compared and the way you compared? or did you do further preprocess?
if so, can you let me know?

definition of Accuracy
I'm curious how you compared the accuracy.
If the smiles and atom indices from the process_reaction_with_product_maps_atoms function are the same, I considered them to be identical and defined accuracy as the ratio of identical cases out of the total cases.
Any help would be appreciated.

atom mapping for some reaction is error, how to fix this probrom

for this reaction
OB(C1=CCCCC1)O.ClC2=CC=CC=C2>>C3(C4=CC=CC=C4)=CCCCC3

the output mapping is:
Cl[c:3]1[cH:4][cH:5][cH:6][cH:7][cH:8]1.OB(O)[C:1]1=[CH:2][CH2:9][CH2:10][CH2:11][CH2:12]1>>[CH:1]1=[C:2]([c:3]2[cH:4][cH:5][cH:6][cH:7][cH:8]2)[CH2:9][CH2:10][CH2:11][CH2:12]1

Request: All source code to reproduce will be contained in this repository?

Greetings,

Some of the reactions cannot be matched correctly, and they need some special modifications (I have tried on Indigo and it did not work either). In order to understand how the process works, may I ask whether all codes are published and stored in this repository so that I can request to look at them in detail, if possible?

If so, may I also ask whether they would be in the "rxnmapper" folder only and neither in "docs" nor "docs_source" by chance?

Thank you,
Tommy

rxn4chemistry / rxnmapper Goto Github PK

rxnmapper's Introduction

Python wrapper for the IBM RXN for Chemistry API

Install

Usage

Create a project

Reaction outcome prediction

Extracting actions from a paragraph describing a recipe

Retrosynthesis prediction

Biocatalysed retrosynthesis prediction

Prediction of reaction properties (atom-to-atom mapping, reaction yield, ...)

Create a synthesis and start it on the robot (or simulator)

Forward prediction in batch

Prediction of multiple reaction outcomes (in batch)

Enable logging

Examples

Documentation

rxnmapper's People

Contributors

Stargazers

Watchers

Forkers

rxnmapper's Issues

Other information

Recommend Projects

Recommend Topics

Recommend Org

Jobs