Implementation of reaction condition prediction with Parrot
- Publication
- Quickly Start From Gitpod
- OS Requirements
- Python Dependencies
- Installation Guide
- Use Parrot
- Reproduce the results
- Cite Us
Xiaorui Wang, Chang-Yu Hsieh*, Xiaodan Yin, Jike Wang, Yuquan Li, Yafeng Deng, Dejun Jiang, Zhenxing Wu, Hongyan Du, Hongming Chen, Yun Li, Huanxiang Liu, Yuwei Wang, Pei Luo, Tingjun Hou*, Xiaojun Yao*. Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center. Research 2023;6:Article 0231. DOI:10.34133/research.0231
This repository has been tested on Linux operating systems.
- Python (version >= 3.7)
- PyTorch (version >= 1.10.0)
- RDKit (version >= 2019)
- Transformers (version == 4.18.0)
- Simpletransformers (version == 0.63.6)
Create a virtual environment to run the code of Parrot.
It is recommended to use conda to manage the virtual environment.The installation method for conda can be found here.
Make sure to install pytorch with the cuda version that fits your device.
This process usually takes few munites to complete.
git clone https://github.com/wangxr0526/Parrot.git
cd Parrot
conda env create -f envs.yaml
conda activate parrot_env
pip install gdown wtforms flask flask_bootstrap
You can use Parrot to predict suitable catalysts, solvents and reagents, and temperatures for reactions.
First download the model and datasest files by this command:
python preprocess_script/download_data.py
The links correspond to the paths of the zip files as follows:
https://drive.google.com/uc?id=1aX70qzZrJ9TZ9KpqnvUVR8WBxiTwXOsI ---> dataset/source_dataset/USPTO_condition_final.zip
https://drive.google.com/uc?id=1uEqpkF4tPTlLIPdTyWJdXows7hKQbAAc ---> dataset/pretrain_data.zip
https://drive.google.com/uc?id=1gFV2KdVKaLCTeb3nrzopyYHXbM0G_cr_ ---> outputs/Parrot_train_in_USPTO_Condition_enhance.zip
https://drive.google.com/uc?id=1bVB89ByGkYjiUtbvEcp1mgwmoKy5Ka2b ---> outputs/best_rcm_model_pretrain.zip
https://drive.google.com/uc?id=1DmHILXSOhUuAzqF0JmRTx1EcOOQ7Bm5O ---> outputs/best_mlm_model_pretrain.zip
We provide two usage methods, one is to use the command line, and the other is through the web interface.
Then prepare the txt file containing the SMILES of the reactions you want to predict, and enter the following command:
cd Parrot
python inference.py --config_path path/to/config_file.yaml \
--input_path path/to/input_file.txt \
--output_path path/to/output.csv \
--num_workers NUM_WORKERS \
--inference_batch_size BATCH_SIZE \
--gpu CUDA_ID # use cpu: CUDA_ID=-1
For example, using Parrot predictions trained on the USPTO-Condition dataset, use the following command:
python inference.py --config_path configs/config_inference_use_uspto.yaml \
--input_path test_files/input_demo.txt \
--output_path test_files/predicted_conditions.csv \
--num_workers 6 \
--inference_batch_size 8 \
--gpu 0
Or using Parrot predictions trained on the Reaxys-TotalSyn-Condition dataset, use the following command:
# Could be used to predict temperatures.
python inference.py --config_path configs/config_inference_use_reaxys.yaml \
--input_path test_files/input_demo.txt \
--output_path test_files/predicted_conditions.csv \
--num_workers 6 \
--inference_batch_size 8 \
--gpu 0
Use this command to run web interface.
cd web_app
python app.py
Open the browser, enter: http://127.0.0.1:8000 and you will see the following interface:
Support three input methods
The complete processed USPTO-Condition, USPTO-Suzuki-Condition and pretrain dataset after USE Parrot is already in dataset/source_dataset/USPTO_condition_final
and dataset/pretrain_data
, if you want to recreate the USPTO-Condition dataset, you can read here. If you want to use Reaxys-TotalSyn-Condition, you can only process it from scratch. We provide the ReaxysID of the data and the script for processing. For details, you can read here.The final dataset
directory structure should be as follows:
dataset/
├── pretrain_data
│ ├── mlm_rxn_train.txt # MLM pretrain dataset (train)
│ ├── mlm_rxn_val.txt # MLM pretrain dataset (validation)
│ ├── rxn_center_modeling.pkl # RCM pretrain dataset (train + validation)
│ └── vocab.txt # Parrot reaction SMILES vocabulary
└── source_dataset
├── Reaxys_total_syn_condition_final
│ ├── Reaxys_total_syn_condition.csv
│ ├── Reaxys_total_syn_condition_alldata_idx.pkl
│ └── Reaxys_total_syn_condition_condition_labels.pkl
└── USPTO_condition_final
├── canonical_pistachio_label.json
├── condition_replace_dict_final.json
├── USPTO_condition_alldata_idx.pkl
├── USPTO_condition_aug_n5_alldata_idx.pkl
├── USPTO_condition_aug_n5_condition_labels.pkl
├── USPTO_condition_aug_n5.csv
├── USPTO_condition_condition_labels.pkl
├── USPTO_condition.csv
├── USPTO_condition_pred_category.csv
└── USPTO_condition_pred_category_org.csv
- Masked Language Modeling pretrain:
python pretrain_mlm.py --gpu CUDA_ID --config_path configs/pretrain_mlm_config.yaml
- masked Reaction Center Modeling pretrain:
python pretrain_rcm.py --gpu CUDA_ID --config_path configs/pretrain_rcm_config.yaml
After pretraining, you will get best_mlm_uspto_pretrain
and best_rcm_uspto_pretrain
containing model state in outputs
.
Training in the USPTO-Condition dataset:
- Parrot-ML
python train_parrot_model.py --gpu CUDA_ID --config_path configs/config_uspto_condition.yaml
- Parrot-ML-E
python train_parrot_model.py --gpu CUDA_ID \ --config_path configs/config_uspto_condition_aug_n5_lr_low.yaml
Training in the Reaxy-TotalSyn-Condition dataset:
- Parrot-RCM
python train_parrot_model.py --gpu CUDA_ID \ --config_path configs/config_reaxys_totalsyn_condition.yaml
Test in the USPTO-Condition dataset:
- Parrot-ML-E
python test_parrot_model.py --gpu CUDA_ID \ --config_path configs/config_uspto_condition_aug_n5_lr_low.yaml
Test in the Reaxy-TotalSyn-Condition dataset:
- Parrot-RCM
python test_parrot_model.py --gpu CUDA_ID \ --config_path configs/config_reaxys_totalsyn_condition.yaml
@article{
doi:10.34133/research.0231,
author = {Xiaorui Wang and Chang-Yu Hsieh and Xiaodan Yin and Jike Wang and Yuquan Li and Yafeng Deng and Dejun Jiang and Zhenxing Wu and Hongyan Du and Hongming Chen and Yun Li and Huanxiang Liu and Yuwei Wang and Pei Luo and Tingjun Hou and Xiaojun Yao },
title = {Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center},
journal = {Research},
volume = {6},
pages = {0231},
year = {2023},
doi = {10.34133/research.0231},
URL = {https://spj.science.org/doi/abs/10.34133/research.0231},
eprint = {https://spj.science.org/doi/pdf/10.34133/research.0231},
}