Regarding relation between train_pairs.txt and mol.txt

Hierarchical Generation of Molecular Graphs using Structural Motifs

Our paper is at https://arxiv.org/pdf/2002.03230.pdf

Installation

First install the dependencies via conda:

PyTorch >= 1.0.0
networkx
RDKit >= 2019.03
numpy
Python >= 3.6

And then run pip install .. Additional dependency for property-guided finetuning:

Chemprop >= 1.2.0

Data Format

For graph generation, each line of a training file is a SMILES string of a molecule
For graph translation, each line of a training file is a pair of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see data/qed/train_pairs.txt. The test file is a list of molecules to be optimized. Please see data/qed/test.txt.

Molecule generation pretraining procedure

We can train a molecular language model on a large corpus of unlabeled molecules. We have uploaded a model checkpoint pre-trained on ChEMBL dataset in ckpt/chembl-pretrained/model.ckpt. If you wish to train your own language model, please follow the steps below:

Extract substructure vocabulary from a given set of molecules:

python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt

Preprocess training data:

python preprocess.py --train data/chembl/all.txt --vocab data/chembl/all.txt --ncpu 16 --mode single
mkdir train_processed
mv tensor* train_processed/

Train graph generation model

mkdir ckpt/chembl-pretrained
python train_generator.py --train train_processed/ --vocab data/chembl/vocab.txt --save_dir ckpt/chembl-pretrained

Sample molecules from a model checkpoint

python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsamples 1000

Property-guided molcule generation procedure (a.k.a. finetuning)

The following script loads a trained Chemprop model and finetunes a pre-trained molecule language model to generate molecules with specific chemical properties.

mkdir ckpt/finetune
python finetune_generator.py --train ${ACTIVE_MOLECULES} --vocab data/chembl/vocab.txt --generative_model ckpt/chembl-pretrained/model.ckpt --chemprop_model ${YOUR_PROPERTY_PREDICTOR} --min_similarity 0.1 --max_similarity 0.5 --nsample 10000 --epoch 10 --threshold 0.5 --save_dir ckpt/finetune

Here ${ACTIVE_MOLECULES} should contain a list of experimentally verified active molecules.

${YOUR_PROPERTY_PREDICTOR} should be a directory containing saved chemprop model checkpoint.

--max_similarity 0.5 means any novel molecule should have nearest neighbor similarity lower than 0.5 to any known active molecules in ${ACTIVE_MOLECULES}` file.

--nsample 10000 means to sample 10000 molecules in each epoch.

--threshold 0.5 is the activity threshold. A molecule is considered as active if its predicted chemprop score is greater than 0.5.

In each epoch, generated active molecules are saved in ckpt/finetune/good_molecules.${epoch}. All the novel active molecules are saved in ckpt/finetune/new_molecules.${epoch}

Molecule translation training procedure

Molecule translation is often useful for lead optimization (i.e., modifying a given molecule to improve its properties)

Extract substructure vocabulary from a given set of molecules:

python get_vocab.py --ncpu 16 < data/qed/mols.txt > vocab.txt

Please replace data/qed/mols.txt with your molecules.

Preprocess training data:

python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16
mkdir train_processed
mv tensor* train_processed/

Train the model:

mkdir ckpt/translation
python train_translator.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir ckpt/translation

Make prediction on your lead compounds (you can use any model checkpoint, here we use model.5 for illustration)

python translate.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model ckpt/translation/model.5 --num_decode 20 > results.csv

Polymer generation

The polymer generation code is in the polymer/ folder. The polymer generation code is similar to train_generator.py, but the substructures are tailored for polymers. For generating regular drug like molecules, we recommend to use train_generator.py in the root directory.

	def tree_decomp(self):
	clusters = self.clusters
	graph = nx.empty_graph( len(clusters) )
	for atom, nei_cls in enumerate(self.atom_cls):
	if len(nei_cls) <= 1: continue
	bonds = [c for c in nei_cls if len(clusters[c]) == 2]
	rings = [c for c in nei_cls if len(clusters[c]) > 4] #need to change to 2

	if len(nei_cls) > 2 and len(bonds) >= 2:
	clusters.append([atom])
	c2 = len(clusters) - 1
	graph.add_node(c2)
	for c1 in nei_cls:
	graph.add_edge(c1, c2, weight = 100)

	elif len(rings) > 2: #Bee Hives, len(nei_cls) > 2
	clusters.append([atom]) #temporary value, need to change
	c2 = len(clusters) - 1
	graph.add_node(c2)
	for c1 in nei_cls:
	graph.add_edge(c1, c2, weight = 100)
	else:
	for i,c1 in enumerate(nei_cls):
	for c2 in nei_cls[i + 1:]:
	inter = set(clusters[c1]) & set(clusters[c2])
	graph.add_edge(c1, c2, weight = len(inter))

	n, m = len(graph.nodes), len(graph.edges)
	assert n - m <= 1 #must be connected
	return graph if n - m == 1 else nx.maximum_spanning_tree(graph)

wengong-jin / hgraph2graph Goto Github PK

hgraph2graph's Introduction

Hierarchical Generation of Molecular Graphs using Structural Motifs

Installation

Data Format

Molecule generation pretraining procedure

Property-guided molcule generation procedure (a.k.a. finetuning)

Molecule translation training procedure

Polymer generation

hgraph2graph's People

Contributors

Stargazers

Watchers

Forkers

hgraph2graph's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs