aspuru-guzik-group / selfies Goto Github PK

Robust representation of semantically constrained graphs, in particular for molecules in chemistry

License: Apache License 2.0

Python 100.00%

selfies's Introduction

SELFIES

Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation
Mario Krenn, Florian Haese, AkshatKumar Nigam, Pascal Friederich, Alan Aspuru-Guzik
Machine Learning: Science and Technology 1, 045024 (2020), extensive blog post January 2021.
Talk on youtube about SELFIES.
A community paper with 31 authors on SELFIES and the future of molecular string representations.
Blog explaining SELFIES in Japanese language
Code-Paper in February 2023
SELFIES in Wolfram Mathematica (since Dec 2023)
Major contributors of v1.0.n: Alston Lo and Seyone Chithrananda
Main developer of v2.0.0: Alston Lo
Chemistry Advisor: Robert Pollice

A main objective is to use SELFIES as direct input into machine learning models, in particular in generative models, for the generation of molecular graphs which are syntactically and semantically valid.

Installation

Use pip to install selfies.

pip install selfies

To check if the correct version of selfies is installed, use the following pip command.

pip show selfies

To upgrade to the latest release of selfies if you are using an older version, use the following pip command. Please see the CHANGELOG to review the changes between versions of selfies, before upgrading:

pip install selfies --upgrade

Usage

Overview

Please refer to the documentation, which contains a thorough tutorial for getting started with selfies and detailed descriptions of the functions that selfies provides. We summarize some key functions below.

Function	Description
`selfies.encoder`	Translates a SMILES string into its corresponding SELFIES string.
`selfies.decoder`	Translates a SELFIES string into its corresponding SMILES string.
`selfies.set_semantic_constraints`	Configures the semantic constraints that `selfies` operates on.
`selfies.len_selfies`	Returns the number of symbols in a SELFIES string.
`selfies.split_selfies`	Tokenizes a SELFIES string into its individual symbols.
`selfies.get_alphabet_from_selfies`	Constructs an alphabet from an iterable of SELFIES strings.
`selfies.selfies_to_encoding`	Converts a SELFIES string into its label and/or one-hot encoding.
`selfies.encoding_to_selfies`	Converts a label or one-hot encoding into a SELFIES string.

Examples

Translation between SELFIES and SMILES representations:

import selfies as sf

benzene = "c1ccccc1"

# SMILES -> SELFIES -> SMILES translation
try:
    benzene_sf = sf.encoder(benzene)  # [C][=C][C][=C][C][=C][Ring1][=Branch1]
    benzene_smi = sf.decoder(benzene_sf)  # C1=CC=CC=C1
except sf.EncoderError:
    pass  # sf.encoder error!
except sf.DecoderError:
    pass  # sf.decoder error!

len_benzene = sf.len_selfies(benzene_sf)  # 8

symbols_benzene = list(sf.split_selfies(benzene_sf))
# ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[=Branch1]']

Very simple creation of random valid molecules:

A key property of SELFIES is the possibility to create valid random molecules in a very simple way -- inspired by a tweet by Rajarshi Guha:

import selfies as sf
import random

alphabet=sf.get_semantic_robust_alphabet() # Gets the alphabet of robust symbols
rnd_selfies=''.join(random.sample(list(alphabet), 9))
rnd_smiles=sf.decoder(rnd_selfies)
print(rnd_smiles)

These simple lines gives crazy molecules, but all are valid. Can be used as a start for more advanced filtering techniques or for machine learning models.

Integer and one-hot encoding SELFIES:

In this example, we first build an alphabet from a dataset of SELFIES strings, and then convert a SELFIES string into its padded encoding. Note that we use the [nop] (no operation) symbol to pad our SELFIES, which is a special SELFIES symbol that is always ignored and skipped over by selfies.decoder, making it a useful padding character.

import selfies as sf

dataset = ["[C][O][C]", "[F][C][F]", "[O][=O]", "[C][C][O][C][C]"]
alphabet = sf.get_alphabet_from_selfies(dataset)
alphabet.add("[nop]")  # [nop] is a special padding symbol
alphabet = list(sorted(alphabet))  # ['[=O]', '[C]', '[F]', '[O]', '[nop]']

pad_to_len = max(sf.len_selfies(s) for s in dataset)  # 5
symbol_to_idx = {s: i for i, s in enumerate(alphabet)}

dimethyl_ether = dataset[0]  # [C][O][C]

label, one_hot = sf.selfies_to_encoding(
   selfies=dimethyl_ether,
   vocab_stoi=symbol_to_idx,
   pad_to_len=pad_to_len,
   enc_type="both"
)
# label = [1, 3, 1, 4, 4]
# one_hot = [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]

Customizing SELFIES:

In this example, we relax the semantic constraints of selfies to allow for hypervalences (caution: hypervalence rules are much less understood than octet rules. Some molecules containing hypervalences are important, but generally, it is not known which molecules are stable and reasonable).

import selfies as sf

hypervalent_sf = sf.encoder('O=I(O)(O)(O)(O)O', strict=False)  # orthoperiodic acid
standard_derived_smi = sf.decoder(hypervalent_sf)
# OI (the default constraints for I allows for only 1 bond)

sf.set_semantic_constraints("hypervalent")
relaxed_derived_smi = sf.decoder(hypervalent_sf)
# O=I(O)(O)(O)(O)O (the hypervalent constraints for I allows for 7 bonds)

Explaining Translation:

You can get an "attribution" list that traces the connection between input and output tokens. For example let's see which tokens in the SELFIES string [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1] are responsible for the output SMILES tokens.

selfies = "[C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]"
smiles, attr = sf.decoder(
    selfies, attribute=True)
print('SELFIES', selfies)
print('SMILES', smiles)
print('Attribution:')
for smiles_token in attr:
    print(smiles_token)

# output
SELFIES [C][N][C][Branch1][C][P][C][C][Ring1][=Branch1]
SMILES C1NC(P)CC1
Attribution:
AttributionMap(index=0, token='C', attribution=[Attribution(index=0, token='[C]')])
AttributionMap(index=2, token='N', attribution=[Attribution(index=1, token='[N]')])
AttributionMap(index=3, token='C', attribution=[Attribution(index=2, token='[C]')])
AttributionMap(index=5, token='P', attribution=[Attribution(index=3, token='[Branch1]'), Attribution(index=5, token='[P]')])
AttributionMap(index=7, token='C', attribution=[Attribution(index=6, token='[C]')])
AttributionMap(index=8, token='C', attribution=[Attribution(index=7, token='[C]')])

attr is a list of AttributionMaps containing the output token, its index, and input tokens that led to it. For example, the P appearing in the output SMILES at that location is a result of both the [Branch1] token at position 3 and the [P] token at index 5. This works for both encoding and decoding. For finer control of tracking the translation (like tracking rings), you can access attributions in the underlying molecular graph with get_attribution.

More Usages and Examples

More examples can be found in the examples/ directory, including a variational autoencoder that runs on the SELFIES language.
This ICLR2020 paper used SELFIES in a genetic algorithm to achieve state-of-the-art performance for inverse design, with the code here.
SELFIES allows for highly efficient exploration and interpolation of the chemical space, with a deterministic algorithms, see code.
We use SELFIES for Deep Molecular dreaming, a new generative model inspired by interpretable neural networks in computational vision. See the code of PASITHEA here.
Kohulan Rajan, Achim Zielesny, Christoph Steinbeck show in two papers that SELFIES outperforms other representations in img2string and string2string translation tasks, see the codes of DECIMER and STOUT.
Nathan Frey, Vijay Gadepally, and Bharath Ramsundar used SELFIES with normalizing flows to develop the FastFlows framework for deep chemical generative modeling.
An improvement to the old genetic algorithm, the authors have also released JANUS, which allows for more efficient optimization in the chemical space. JANUS makes use of STONED-SELFIES and a neural network for efficient sampling.

Tests

selfies uses pytest with tox as its testing framework. All tests can be found in the tests/ directory. To run the test suite for SELFIES, install tox and run:

tox -- --trials=10000 --dataset_samples=10000

By default, selfies is tested against a random subset (of size dataset_samples=10000) on various datasets:

130K molecules from QM9
250K molecules from ZINC
50K molecules from a dataset of non-fullerene acceptors for organic solar cells
160K+ molecules from various MoleculeNet datasets
36M+ molecules from the eMolecules Database. Due to its large size, this dataset is not included on the repository. To run tests on it, please download the dataset into the tests/test_sets directory and run the tests/run_on_large_dataset.py script.

Version History

See CHANGELOG.

Credits

We thank Jacques Boitreaud, Andrew Brereton, Nessa Carson (supersciencegrl), Matthew Carbone (x94carbone), Vladimir Chupakhin (chupvl), Nathan Frey (ncfrey), Theophile Gaudin, HelloJocelynLu, Hyunmin Kim (hmkim), Minjie Li, Vincent Mallet, Alexander Minidis (DocMinus), Kohulan Rajan (Kohulan), Kevin Ryan (LeanAndMean), Benjamin Sanchez-Lengeling, Andrew White, Zhenpeng Yao and Adamo Young for their suggestions and bug reports, and Robert Pollice for chemistry advices.

License

Apache License 2.0

selfies's People

Contributors

Stargazers

Watchers

Forkers

mariokrenn6240 aspirincode cxz xuhanliu rouzbeh-afrasiabi akshat998 laixindev holeungng-lab xuzhang5788 d-barradas nicklarson6 intuitionmachine cecilepereiratotal catenate15 roysh ragy-deepbiome jacquesboitreaud bbardakk cyntsh darrenwee chertianser exjustice kottmanj minghao2016 florianhase seyonechithrananda ncfrey miracle1111111 iammosespaulr phenylazide sagieppel unixjunkie caiyingchun mateuszrezler petrichorcode iliazenkov jimmy-inl larsac driesvr zhenglz aidrug flavda jiangjing0122 fanwangm liuyunwu aeliklil pk-organics suhan-paradkar d6y ceng-p biocheming z5476t4508 whitead masrul deepakorani napoles-uach andrewolal ur-whitelab dangerlin nataliemaus csnbritt fermiq songsiwei axelrolov bananenpampe zhangdachuanfoodies rnaimehaom hanmingcr changzhijiang ralfgermany etiennereboul fastflair dnight5 inefable12 jingyi-ran shunsunsun giribio dnhkng zuige970126 sebastianshi tamateryoyanagoyamlgmailcom mathcom gharib85 python-repository-hub irumeria hello-maker sabarikumar jobo322 giangnt standardgalactic mehdimoazami blagowhatnow dualword jingqiong santiadavani yqyang733 eliochen ilyes319 mircare takshan

selfies's Issues

I'm unable to encode the following SMILES molecule.

Cc1c(C)c(S(=O)(=O)NC(=N)NCCC[C@H](NC(=O)[C@@H]2CCCN2C(=O)[C@H](CCC(=O)NC(c2ccccc2)(c2ccccc2)c2ccccc2)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCCNC(=O)OC(C)(C)C)NC(=O)[C@H](C)NC(=O)[C@@H]2CCCN2C(=O)[C@@H]2CCCN2C(=O)[C@H](CCCCNC(=O)OC(C)(C)C)NC(=O)[C@H](CCCCNC(=O)OC(C)(C)C)NC(=O)[C@H](COC(C)(C)C)NC(=O)[C@H](CCC(=O)OC(C)(C)C)NC(=O)[C@H](CCCCNC(=O)OC(C)(C)C)NC(=O)[C@H](CCCNC(=N)NS(=O)(=O)c2c(C)c(C)c3c(c2C)CCC(C)(C)O3)NC(=O)[C@H](CCC(=O)NC(c2ccccc2)(c2ccccc2)c2ccccc2)NC(=O)[C@H](CCC(=O)NC(c2ccccc2)(c2ccccc2)c2ccccc2)NC(=O)[C@@H](NC(=O)[C@H](CCCNC(=N)NS(=O)(=O)c2c(C)c(C)c3c(c2C)CCC(C)(C)O3)NC(=O)[C@H](CCC(=O)NC(c2ccccc2)(c2ccccc2)c2ccccc2)NC(=O)[C@H](Cc2cn(C(=O)OC(C)(C)C)cn2)NC(=O)[C@H](CCC(=O)OC(C)(C)C)NC(=O)[C@@H]2CCCN2C(=O)[C@H](COC(C)(C)C)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](COC(c2ccccc2)(c2ccccc2)c2ccccc2)NC(=O)[C@H](COC(C)(C)C)NC(=O)CNC(=O)OC(C)(C)C)C(C)C)C(=O)O)c(C)c2c1OC(C)(C)CC2

Hangs on some invalid SMILES inputs with hanging open parenthesis

I haven't looked into it, but when testing whether invalid SMILES would work properly (expected to return None), I tried this:

import selfies as sf
smiles = 'cc('
enc = sf.encoder(smiles)

which caused the program to hang and CPU usage to go really high.
When I did a KeyboardInterrupt, this is where it got stuck:

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    sf.encoder(smiles)
  File "C:\Users\S1024501\AppData\Local\Programs\Python\Python39\lib\site-packages\selfies\encoder.py", line 66, in encoder
    all_selfies.append(_translate_smiles(s))
  File "C:\Users\S1024501\AppData\Local\Programs\Python\Python39\lib\site-packages\selfies\encoder.py", line 178, in _translate_smiles
    selfies, _ = _translate_smiles_derive(smiles_gen, rings, derive_counter)
  File "C:\Users\S1024501\AppData\Local\Programs\Python\Python39\lib\site-packages\selfies\encoder.py", line 230, in _translate_smiles_derive
    N_as_symbols = get_symbols_from_n(branch_len - 1)
  File "C:\Users\S1024501\AppData\Local\Programs\Python\Python39\lib\site-packages\selfies\grammar_rules.py", line 315, in get_symbols_from_n
    n //= base
KeyboardInterrupt

This was reproducible, and also happened for 'co(', 'cccc(', 'cccc(1', 'cccccc(', '('. Any other value of smiles (valid or invalid!) I've tried so far has worked as expected, including the strings 'c(', 'ccc(', 'ccccc(', 'cccc)', 'cccc(c'.
'ccc(c' did actually return a valid SELFIE for butadiene, which seems reasonable. Interestingly, ')' returned only '' (empty string rather than None).
Sorry for semi-deliberately breaking your awesome program :D

SELFIES tokens list

Hi @MarioKrenn6240 ,

Would it be possible to publish the old list of SELFIES tokens and the tokens which replaces them in the new SELFIES version?

Since STOUT and DECIMER uses a set of tokens when using machine learning models they are failing with the new SELFIES 2.0 due to token issue. Having the list of tokens I could check them and replace them.

-Kohulan

Wildcard Error message of encoder (and decoder)

The try block in the encoder seems to check for wildcard in the molecule, and it works, except it doesn't print out the appropriate error message regarding the wildcard.

I.e.
smiles = "C[C@H](O)[C@@(*)C1=CC=CC=C1"
encoder returns as a result (correctly) None.
(side note: I personally would have used a regex matcher (if re.match('[*]?', smiles): ....) instead, but '*' statement seems to work!?)
Anyway, no message is printed out pointing to this issue. I am no good with try/except so I can't say what's going here, other than it doesn't seem to work as implied?

Also, wouldn't it be better instead to check for the wildcard in the smiles.split loop? And while you are doing this, you could also check for R groups, thus eliminating e.g. the currently other open issue (where one can suspect you won't get a reply back....).

And finally, you return "None" as a result in the case of error.
If you then submit None to the decoder, Python throws errors. It remains silent though if you submit a nonsense string (or simply an empty one, or a None (as string)), no error message as the code would imply. I get e.g. this:

  File "test.py", line 11, in <module>
    decoded_smiles = sf.decoder(None)  # SELFIES --> SMILES
  File "/home/alex/dev/selfies/selfies-master/selfies/decoder.py", line 35, in decoder
    for s in selfies.split("."):
AttributeError: 'NoneType' object has no attribute 'split'

Dataset used for Fig. 4 in SELFIES paper

In figure 4 of the SELFIES paper, you predict the logP solubility using the latent representation of a molecule. I would like to reproduce this type of result but I can't find a QM9 dataset with the logP target property. The QM9 dataset in this repo only has the SMILES encoding of each molecule. Can you upload the QM9 dataset used to compare predicted and target logP properties?

Potential bug in kekulization

I was using selfies.encoder with a non-kekulized smiles string NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O, and I got the error `Encoding error 'NC(=O)c1cccc2c1-c1ccc(cc1)-n-c-2=O': kekulization algorithm failed'.

However, I am able to kekulize the string with rdkit, using rdkit.Chem.Kekulize(mol); rdkit.Chem.MolToSmiles(mol,kekuleSmiles=True). The resulting smiles string is NC(=O)C1=CC=CC2=C1C1=CC=C(C=C1)NC2=O which can then be encoded as the selfies string [N][C][Branch1_2][C][=O][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=C][C][=C][Branch1_1][Branch1_1][C][=C][Ring1][Branch1_2][N][C][Ring1][Branch2_3][=O] without error. I am just wondering if this is expected behaviour or a possibly a bug, I understand that kekulization algorithms sometimes can produce different results.

I am using python 3.7.10, selfies 1.0.3, and rdkit 2018.09.3

Question regarding the output of selfies

Hello,

I have a question regarding the output selfies. I trained an simple teacher-forced RNN based on selfies of molecules. My trained model generates selfies like the example below:

 [C][=S@@expl][=SHexpl][=P][/S+expl][/C@@Hexpl][/P@@expl][\Siexpl][C][C][C][/I][IH2expl]<sos>[C][SnH2expl][Sn+expl][=S@@expl][\S][#N+expl][=I][\I][C][C@Hexpl][Expl/Ring2][P@expl][C][\CH-expl][Branch1_3][Expl=Ring2][C][/B][P][\S][\C@Hexpl][=P@expl][Br+expl][=P][\S+expl][/O][/P][=S@@expl][N+expl][=P][/N][B-expl][=S][N@@+expl][Cexpl][\C@@Hexpl][/N][=N][Oexpl][\O][/C][/NHexpl][/S@expl][B-expl][S@@Hexpl][\S][C@@Hexpl][Sn+expl][SiH3expl][C][\B][S@@expl][Ring2][/C][P+expl][N][C][N][C][C][C][Branch2_3][=S+expl][BH2-expl][=Nexpl][#N][/C@@expl][Nexpl][CH-expl][/B][C][C][\N][S@@+expl][/P@expl][SnH4+2expl][Branch2_2][=P][/C@expl][/S+expl][S@Hexpl][\B][/C][N][Snexpl][BH3-expl][B-expl][\S][=S][N+expl][Sn+3expl][/I][\P][\S+expl][F][Cexpl][/S@expl][Siexpl][B@@-expl][/C@Hexpl][\Cl][/P@expl][O][O-expl][\C@@expl][C-expl][/O+expl][\O-expl][Branch1_3][N][/Cexpl][Expl\Ring1][B@-expl][O][N][SiH3expl][C][SnH6+3expl][=S@@expl][\S@expl][CH-expl][#N+expl][\S@@expl][/Cexpl][SnH4+2expl][=O][S@expl][O][\P][=Snexpl][Cexpl][C][/N+expl][P@@expl][/CH-expl][=IH2expl][N][=P@expl][#C][P@@Hexpl][P+expl][Br][=S@@expl][C-expl][/Br][=N+expl][P][=Siexpl][P][NHexpl][/Br][N+expl][\N+expl][N+expl][O][C][/C][\O-expl][=S@@expl][\S][BH-expl][Branch1_2][SnH2expl][Ring2][/CH-expl][=17Oexpl][\C@expl][=P@Hexpl][=P+expl][O][C][C][NHexpl][C][C][C][C][/P@@expl][N][B@@-expl][C+expl][=S][=P+expl][C][B][\I][=C][C][C][C][O][N][O][Branch2_1][SiH3expl][CHexpl][\S@@expl][/B][=P@Hexpl][Branch2_3][C@@Hexpl][/F][S@+expl][\N][/C][/B][Oexpl][P@@Hexpl][/NHexpl][/O][O][=SHexpl][\Oexpl][=Siexpl][P@+expl][SiH3expl][SnH6+3expl][O-expl][Branch1_2][CH2expl][\S@expl][=SHexpl][PHexpl][\F][\C@@Hexpl][O][C][O][Nexpl][O][C][N][Cl][N@@+expl][P@@Hexpl][O][C][/Cexpl][S@@expl][C][\O][\Br][/O][\N+expl][SHexpl][\Siexpl][=SHexpl][B@-expl][S][Br][\F][\S+expl][C][C][C][=N-expl][BH3-expl][#C-expl][/Cl][\S+expl][Expl=Ring2][CHexpl][C][Expl/Ring2][P@+expl][N-expl][#N][Br][\Snexpl][BH-expl][#C-expl][CH2expl][=Snexpl][/Cl][/P@expl][C][\Oexpl][Expl=Ring1]

However, after passing this SELFIES to sf.decoder() function, the output SMILES looks significantly shorter:

 C=[S@@]=[SH]=P/[S+]/[C@@H]/[P@@]\[Si]CCC/I

When I try to convert the SMILES back to SELFIES back again, I got different SELFIES:

[C][=S@@expl][=SHexpl][=P][/S+expl][/C@@Hexpl][/P@@expl][\Siexpl][C][C][C][/I]

Can I get some explanation why this happens? Thank you so much!

Table 2 rules

Hello authors,

I tried following the rules in Table 2 in a naive way i.e. just following the grammar and I seemed to produce unphysical molecules. So for example:

X0 -> OX2 -> OOX1 -> OO=O

Could someone clarify what is happening here? The other question I had is with the code. Do the numbers in the Ring and Branch "letters" mean anything? I.e. do the numbers 1 and 3 have specific meanings: [Branch1_3] versus [Branch3_1] , and for rings it always seems to be [Ring1] in most examples that I've run. Does the "1" in ring1 become superfluous in this case or agains does it have a specific meaning? Thanks!

[=Branch1] vs [Branch1] ?

Hello,
I am working on improving the stoned-selfies protocol and i was wondering why sometimes the selfies.encoder chooses to use the token [=Branch1] instead of [Branch1].

If we take the celecoxibs:
SMILES= 'CC1=CC=C(C=C1)C2=CC(=NN2C3=CC=C(C=C3)S(=O)(=O)N)C(F)(F)F'
=> SELFIES= '[C][C][=C][C][=C][Branch1][Branch1][C][=C][Ring1][=Branch1][C][=C][C][=Branch2][Ring1][#Branch1][=N][N][Ring1][Branch1][C][=C][C][=C][Branch1][Branch1][C][=C][Ring1][=Branch1][S][=Branch1][C][=O][=Branch1][C][=O][N][C][Branch1][C][F][Branch1][C][F][F]'

Is there a new particular rules for selfies 2.0 for branches and rings? I did not find it in the follow up paper: "SELFIES and the future of molecular string representations"

Thanks you in advance ,
Etienne Reboul

Error handling and general sense on SELFIES generation

Hey! Where can I read about error handling and general SELFIES BS filter?
For example:
sf.encoder('1243124124') - will just hang for a very long time, should it through the error right away?
sf.encoder('SOMETHINGWRONGHERE') will generate '[S][O][M][E][T][H][I][N][G][W][R][O][N][G][H][E][R][E]'

This is linked with the issue I have processing a list of compounds that returned an error while all SMILES were verified by the rdkit accordingly.

--> Acquiring data... Finished acquiring data. Representation: SMILES --> Translating SMILES to SELFIES... 'NoneType' object has no attribute 'find'

question about output molecules

Hi,

I have a general question about your code for VAE or GAN. I've run with your codes, and the output molecules are very diverse.
To generate molecules similar to a query molecule, I'm wondering if there is a way to make the output molecules being similar to the query molecule, perhaps by changing some parameters for decoding or sampling? (instead of using the similarity criteria as a filter)
Thank you!

Clarify Python Version

Related to my discussion in #84, there are inconsistent versions of Python for selfies:

setup.py says 3.5-3.8
tox tests only 3.6
CI env is 3.7

I would propose targeting all currently supported version of Python (3.7-3.10). Python 3.6 reached end of support last year and, for example, numpy dropped support for 3.6 in v1.20.0 (2021). We should also align the versions in tox and setup.py to test stated versions.

Question Regarding Loss Function

Hello, I'm looking to extend your work to another application. We are planning to use Beta-VAE but I noticed you already have a constant alpha infront of the KLD; it's also very very low (0.0001). Conventional VAE loss would have a 1 there, can you provide some intuition into what the function of the low KLD_alpha is for before I start playing around?

Thanks

if `decoder` is called on a string that doesn't end with ], instead of failing it hangs forever

Update for index table documentation needed ??

Hello ,
I am trying to improve on the stoned selfies mutation protocol by comprehensively splitting the SELFIES into the "main" chain and the different branches and rings.
However, I think the current index table in the documentation is for a depreciated version of Selfies: https://selfies.readthedocs.io/en/latest/tutorial.html#index-symbols

Is there a function that i can call to give me the index table like the selfies.get_semantic_robust_alphabet() for the alphabet ?

Thank you in advance

Mistakes on encoder/decoder nitro group

Hello everyone

I was performing some tests and get mistakes from selfies to smiles with nitro groups, generating N,N-Dihydroxyamine

follow code

`import selfies as sf
import rdkit
from rdkit import Chem
from IPython import display

def evalSmilesAndSelfies(input):
strucFromSelfies = sf.decoder(sf.encoder(input))
ms = [Chem.MolFromSmiles(smi) for smi in (input,strucFromSelfies)]
fromSmiles = 'fromSmiles:\n' + input
fromSelfies = 'fromDecodedSelfies:\n' + strucFromSelfies
names = [fromSmiles, fromSelfies]
img = Chem.Draw.MolsToGridImage(ms,molsPerRow=2,subImgSize=(300,300),legends=[name for name in names])
return display.display(img), print(f'smiles: {input},selfies: {strucFromSelfies}, encoded: {sf.encoder(input)}')`

results

when I use charged version the nitro group is ok

It may caused by a mistake in decoder to smiles when dealing with

-N(=O)=O

because de Selfies version before decoder is apparently ok
[C][N][Branch1_2][C][=O][=O]

thanks

Inconsistency with example in README

Via pip, I can't use the functions in "utils.py".
so, I imported it directly.

Could you check my installed version of selfies correctly?

Comprehensive Vocabulary

Is it possible to get a comprehensive vocabulary of the SELFIES language?

Tokenization of unbonded atoms separated by dot (.)

selfies_to_hot does not seem to tokenize unbonded atoms that are separated by dots correctly.

Consider following example, I'm using selfies=v0.2.4.
For convenience I copy the code of selfies_to_hot, I added a print statement to show the effect

def selfies_to_hot(molecule, largest_smile_len, alphabet):
    """
    Go from a single selfies string to a one-hot encoding.
    """
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    # integer encode input smile
    len_of_molecule=len(molecule)-len(molecule.replace('[',''))
    for _ in range(largest_smile_len-len_of_molecule):
        molecule+='[epsilon]'

    selfies_char_list_pre=molecule[1:-1].split('][')
    selfies_char_list=[]
    for selfies_element in selfies_char_list_pre:
        selfies_char_list.append('['+selfies_element+']')   
    print(f"Tokenized molecule {selfies_char_list}")
    integer_encoded = [char_to_int[char] for char in selfies_char_list]
    
        
    # one hot-encode input smile
    onehot_encoded = list()
    for value in integer_encoded:
    	letter = [0 for _ in range(len(alphabet))]
    	letter[value] = 1
    	onehot_encoded.append(letter)
            
    return integer_encoded, np.array(onehot_encoded)

Example:

>>>from selfies import encoder, selfies_alphabet
>>> s = 'O=S(=O)([O-])[O-].[Na+].[Na+]'
>>> selfies = encoder(s)
>>>print(selfies)
 [O][=S][Branch1_1][epsilon][=O][Branch1_3][epsilon][O-expl][O-expl].[Na+expl].[Na+expl]

This is ok. Now define the selfies_to_hot method somewhere (it's not part of the package as it seems)

>>>selfies_to_hot(selfies, 50, selfies_alphabet)
Tokenized molecule:
 ['[O]', '[=S]', '[Branch1_1]', '[epsilon]', '[=O]', '[Branch1_3]', '[epsilon]', '[O-expl]', '[O-expl].[Na+expl].[Na+expl]', '[epsilon]', '[epsilon]', '[epsilon]', '[epsilon]', '[epsilon]', '[epsilon]', '[epsilon]', '[epsilon]', '[epsilon]']

(It may also print a KeyError for [=S] if this default selfies_alphabet is used, but that's not my concern here)
As we see [O-expl].[Na+expl].[Na+expl] is parsed as one entity which does not seem right. I would propose to either adapt encoder to also encapsulate dots in square brackets (...[O-expl][.][Na+expl]...) or change selfies_to_hot to split based on a regexp instead of split('][').

Functions for Quotient Graph and SELFIES

We'd like to add a couple of functions to go between quotient graphs and SELFIES, building on the selfies.kekulize module. This would be included in a shared qgraph.py file.

The functions we would like to add are:

qg_encoder -> Converts SELFIES string (input) into a quotient graph (output)
qg_decoder -> Converts quotient graph (input) into a SELFIES string (output)

This is applicable to crystal structures, which are not represented as molecular graphs. In reference to the crystal structure Discord, a crystal is represented as an infinite periodic graph (see https://application.wiley-vch.de/books/sample/3527409394_c01.pdf). The infinte periodic graph has a translation action, and, when factored out, results in the quotient (or fundamental finite) graph. (See the derivation here: https://link.springer.com/content/pdf/10.1007/s11537-012-1144-4.pdf) Quotient graphs are desirable for crystals because they can represent (by nodes and edges) the infinite periodic graphs in a finite graph, while perserving connectivity.

Are your selfies similar to DeepSmiles, but are better than DeepSmiles?

https://github.com/baoilleach/deepsmiles

Thanks

Limitations of SELFIES to represent stereochemistry

I was wondering what types of stereochemistry can be represented by SELFIES. The answer to a similar but not identical question raised in #38 was responded to by referring to the CHANGELOG of v1.0. It seems that that version simply brought SELFIES up to parity with SMILES. The latter only supports tetrehedral centers and chirality along double bonds, however, other forms are not supported: square planar, octahedral, trigonal-bipyramidal. Although it seems the OpenSMILES specification does include these latter three cases. As far as I am aware, axial and planar chirality are not supported in either SMILES or OpenSMILES. What is the support of these cases in SELFIES?

To get SELFIES characters of specific substructure

First of all, thank you for your wonderful research about molecule representation.

While using selfies, I have one question regarding utilization of SELFIES.
Is there any method which can extract exact locations of specific substructure among whole SELFIES characters of molecule?
I tried to do this by manually checking result after removing some characters, but I couldn't get the result I want.

I would be very grateful if you could give me an answer.
Thanks,
Sejeong.

is selfies supporting stereochemistry, polyvalency, aromaticity, isotopic substitution... now?

Hi, Dear MarioKrenn6240:

I have read your paper about selfies and I find it useful.
You stated in your paper that the selfies did not support stereochemistry, polyvalency, aromaticity, isotopic substitution and other special cases?
I wonder if selfies fixes these problems now?

Regarding of None from smiles

Hi there!

CC(C)(C)C1=CC(=C/C=C/c2cc(C(C)(C)C)[te+]c(C(C)(C)C)c2)C=C(C(C)(C)C)[Se]1
CC(C)(C)C1=CC(=C/C=C/c2cc(C(C)(C)C)[te+]c(C(C)(C)C)c2)C=C(C(C)(C)C)O1
CC(C)(C)C1=CC(=C/C=C/c2cc(C(C)(C)C)[te+]c(C(C)(C)C)c2)C=C(C(C)(C)C)[Te]1
CN(C)c1ccc(-c2cc(-c3ccc(N(C)C)cc3)[te+]c(-c3ccc(N(C)C)cc3)c2)cc1
CCc1nc#cn1CC1CC(c2ccccc2)(c2ccccc2)C(=O)O1
CN(C)c1ccc(-c2cc(-c3ccccc3)[te+]c(-c3ccc(N(C)C)cc3)c2)cc1
CN(C)c1ccc(-c2cc(-c3ccccc3)[te+]c(-c3ccc(N)cc3)c2)cc1
O=C(NCCCCN1CCN(c2cccc(Cl)c2Cl)CC1)c1cc2ccccc2[te]1
COc1ccccc1N1CCN(CCCCNC(=O)c2cc3ccccc3[te]2)CC1
CN(C)c1ccc(-c2cc(-c3ccc(N)cc3)[te+]c(-c3ccc(N4CCOCC4)cc3)c2)cc1
CN(C)c1ccc(-c2cc(-c3ccc(N)cc3)[te+]c(-c3ccc(N)cc3)c2)cc1
Nc1ccc(-c2cc(-c3ccccc3)cc(-c3ccc(N4CCOCC4)cc3)[te+]2)cc1
Nc1ccc(-c2cc(-c3ccc(N4CCOCC4)cc3)cc(-c3ccc(N4CCOCC4)cc3)[te+]2)cc1
Nc1ccc(-c2cc(-c3ccc(N4CCOCC4)cc3)cc(-c3ccc(N)cc3)[te+]2)cc1

With these (smiles format), I can't get the selfies's result from smiles.

Could you check this issue, please?

Generative model are producing invalid molecules with SELFIES?

Hi, I love the idea of using SELFIES instead of SMILES for higher robustness. So recently I am working on a generative model that generating SELFIES instead of SMILES. However, I still got some predicted SELFIES that lead to invalid molecules (cannot parsed by rdkit, i.e., Chem.MolFromSmiles() returns None). I also observed that all molecules contain 'B' (Boron). Did I miss anything?
SELFIES version: 1.0.1
rdkit version: 2021.03.1
I used selfies.decoder() to convert SELFIES to SMILES.
Attached are some example molecules.
Many thanks!
FailedMolecules.csv

upgraded selfies?

Has there been a redefinition on how selfies get generated in the last month or so? My old code that uses this fails a bunch of assertion tests:

e.g. an aromatic benzene (c1ccccc1) smiles used to be converted to [c][c][c][c][c][c][Ring1][Branch1_1] and is now converted to kekule benzene: [C][=C][C][=C][C][=C][Ring1][Branch1_2]

Thanks for the info

SELFIE encoding fails with PF6- or hypervalent Phosphorus

Hello everyone,

please find below an example of the transformation of a SMILES of the PF6- anion to a SELFIES.
This transformation does not respect stoichiometry.
Could you advise what I am doing wrong, or if there is a possible workaround?

fail_smiles = "F[P-](F)(F)(F)(F)F"

hypervalent_sf = sf.encoder(fail_smiles, strict=False)  
#is '[F][P-1][Branch1][C][F][Branch1][C][F][Branch1][C][F][Branch1][C][F][F]'

standard_derived_smi = sf.decoder(hypervalent_sf)

sf.set_semantic_constraints("hypervalent")
sf.decoder(hypervalent_sf) # is 'F[P-1](F)(F)CF'

Encoder differences

It seems there were some major updates in the release of package version 1.X.
I used to use v0.2.4, now I tried to switch to 1.0.1. All unit tests are failing and from the release message it seems there were only minor changes.

E.g. with the old version

>>>encoder('c1cnoc1')
'[c][c][n][o][c][Ring1][Ring2]'

The new version instead

>>>encoder('c1cnoc1')
'[C][C][=N][O][C][Expl=Ring1][Branch1_1]'

Is this expected and considered a minor change? I mean mostly to go from Ring2 token to Branch1_1. I have several similar cases (can report them if needed).
Can you maybe comment on whether you prefer users to use 1.0.1 instead of v0.2.4. Are there red flag errors in the old version? If I update now, I would need to repeat a chunk of work and retrain some generative models (e.g. since the vocabulary seems different). Also, is this version considered stable for the moment? If you plan to update the package again in a few weeks, I would probably wait before bumping up my dependency.

Thanks for the good work :)

arxiv paper related questions

1- In Table 3, what are the letters for each representation, especially for SELFIES? Is SELFIES alphabet is the one described in Table 2? If so, is there way to generate these letters from the current SELFIES output?

2- In Table 3, QM9 dataset has 14 characters for the SELFIES alphabet, organic semiconductor dataset has 22 characters. Do these rules change from dataset to dataset? In our case, will these rules (number of characters) change if we use a database for instance from Pubchem?

3 - We are having difficulties interpreting the Section 4.1 and Table 4. What do non-terminal symbols and the full alphabet correspond to?

The arxiv paper mentions the possibility of a workshop, and we would actually be interested in a short webinar as well.

Randomization feature

Per discussion here, it would be nice to have a randomization feature native in SELFIES. Currently, we use rdkit randomization of SMILES, but it would be better to not require that extra step of converting to SMILES.

Since MolecularGraph is a directed rooted graph, it's not obvious to me how this can be done. The normal way I think is to permute the adjacency matrix, but that's not possible with a directed rooted graph.

decoder(encoder(in)) != in (specific case and suggestion)

I have found a very specific case in which it appears the decoder applied to an encoded canonical SMILES string does not yield the same input. That said, it is actually the same molecule, so I am not sure if this not unintended, so to speak. Here is the code to reproduce.

in_smiles = 'Cc1nnc(C2CC23CCN(c2ncccc2Br)CC3)o1'
x = selfies.encoder(in_smiles)
out_smiles = selfies.decoder(x)

In: in_smiles
Out: 'Cc1nnc(C2CC23CCN(c2ncccc2Br)CC3)o1'

In: out_smiles
Out: 'Cc1nnc(C2CC22CCN(c3ncccc3Br)CC2)o1'

Note the subtle difference. However, explicitly canonicalizing them fixes the issue.

from rdkit import Chem
a = Chem.CanonSmiles(out_smiles)
b = Chem.CanonSmiles(in_smiles)
a == b == in_smiles  # True

where

In: a
Out: 'Cc1nnc(C2CC23CCN(c2ncccc2Br)CC3)o1'

I do not know enough about how this encoding works to suggest an explicit solution. However, if you're willing to include rdkit dependencies, this does seem to address the ambiguity! I think further testing would be required to be sure.

SELFIES that can't be parsed by RDKit

Hi Mario,
Here is a list of the invalid SELFIES from our recent preprint as you requested. Among the ~220m SELFIES we generated there seem to have been 22,387 that couldn't be parsed for one reason or another.
I hope it wasn't something wrong on my end. Just to be sure, here is the code I used to identify invalid SELFIES:

smiles = decoder(selfies)
mol = Chem.MolFromSmiles(str(smiles))
if mol is None:
        raise ValueError("invalid SELFIES: " + str(smiles))

Hope this is helpful.
invalid-SELFIES.txt

Index symbol customization questions/enhancement suggestions

Hi all - I have some questions and enhancement suggestions regarding the index symbols. Specifically whether reusing the same tokens to represent atoms/rings/branches and to calculate state Q makes the syntax more difficult for a neural network to learn, and if it would be possible to customize which symbols represent which indices.

Motivation:
Per the readthedocs the current SELFIES index symbol list is the following:

Index	Symbol	Index	Symbol
0	[C]	8	[Branch2_3]
1	[Ring1]	9	[O]
2	[Ring2]	10	[N]
3	[Branch1_1]	11	[=N]
4	[Branch1_2]	12	[=C]
5	[Branch1_3]	13	[#C]
6	[Branch2_1]	14	[S]
7	[Branch2_2]	15	[P]

By reusing the same tokens for both determining the state Q and representing atoms, I wonder if the process for a neural network to learn the syntax isn't made more difficult. For example, an embedding layer might try to embed [C] and [=C] close together because both represent carbon atoms, while at the same trying to embed them far apart because these tokens also represent indices that are far apart (0 vs. 12). These tokens represent different things based on context, and eliminating this context dependence could lead to better performance for translation/representation tasks. Additionally, there may be different sets of optimal symbols to represent indices based on the frequency of tokens within a given dataset. For example, a generative model trained on a dataset without phosphorus or sulfur atoms may incidentally generate molecules that contain these atoms due to decoding errors that generate these tokens in positions that are not being used to determine the state Q.

I understand that reusing the tokens to calculate the state Q allows for the 100% validity of SELFIES, but I'm wondering if it's possible to maintain this 100% validity while making the syntax more easily learned, and possibly customizable based on user needs. My questions are therefore:

How were the symbols that represent each index decided? Frequency based within a large dataset?
Would it be possible to allow users to customize which tokens represent which indices?
Would it be possible to use characters that specifically represent indices (i.e. index tokens) when encoding a SMILES string into a SELFIES, but internally convert these index tokens back into tokens that don't break the 100% validity of SELFIES when decoding back to SMILES?

==== EXAMPLE ===

Below is an example to illustrate what I mean by index tokens. A new index symbol table might look like this:

Index	Internal Symbol	External Symbol	Index	Internal Symbol	External Symbol
0	[C]	[Index0]	8	[Branch2_3]	[Index8]
1	[Ring1]	[Index1]	9	[O]	[Index9]
2	[Ring2]	[Index2]	10	[N]	[Index10]
3	[Branch1_1]	[Index3]	11	[=N]	[Index11]
4	[Branch1_2]	[Index4]	12	[=C]	[Index12]
5	[Branch1_3]	[Index5]	13	[#C]	[Index13]
6	[Branch2_1]	[Index6]	14	[S]	[Index14]
7	[Branch2_2]	[Index7]	15	[P]	[Index15]

benzene SMILES: c1ccccc1

current benzene SELFIES: [C][=C][C][=C][C][=C][Ring1][Branch1_2]

With the new index table, rather than representing the state Q with token [Branch1_2] in the SELFIES string, we could represent it with [Index4], so the new benzene SELFIES that is encoded with index tokens would be: [C][=C][C][=C][C][=C][Ring1][Index4]

When decoding any of these SELFIES back into SMILES, all "external index symbols" could first be replaced with corresponding "internal index symbols" so that the 100% validity is maintained. Networks may learn the syntax that use the index symbols more easily however because each token only corresponds to one action, rather than representing either a state calculation or an atom depending on the context.

This external/internal symbol idea could be made customizable by allowing users to define which internal index symbols are mapped to external index symbols.

Final thoughts:
Apologies if any of these suggestions/questions have been proposed or answered before, I took a look through the original SELFIES paper and the open and closed issues on github and didn't immediately see anything related.

I don't see how the internal vs. external index symbols could break the syntax, but perhaps there is a reason not to do this that I missed. Additionally, adding the extra step of converting external index symbols into internal index symbols before decoding into SMILES adds complexity to the process for what may be no additional gain in performance. I don't know whether a new syntax that uses tokens specifically for state Q calculation is actually more easily understood by neural networks/humans, but I thought it may be worth testing.

Regarding the customization of the index symbols, one problem that I foresee is that index tables would need to be shared to properly decode SELFIES that use custom tables. Again, additional complexity for an unknown amount of performance gain (if any).

Selfies alphabet token changes

sf.encoder('C#[S-]')

outputs

'[C][#S-expl]'

which uses a token [#S-expl] that does not appear on get_semantic_robust_alphabet. The closest token I see is [#S+1expl].

This arises by running the following code:

start_selfies = '[C][#S-1expl]' 
start_smiles = sf.decoder(start_selfies)
end_smiles = rdkit.Chem.MolToSmiles(rdkit.Chem.MolFromSmiles(start_smiles))
print('smiles change', start_smiles, end_smiles)
end_selfies = sf.encoder(end_smiles)
print('selfies change', start_selfies, end_selfies)

smiles change C#[S-1] C#[S-]
selfies change [C][#S-1expl] [C][#S-expl]

Is this intended behavior? I'm having this happen while running STONED and it's changing my alphabet. Any ideas would be appreciated!

Not using <sos> tokens?

Hi,

I was going through the example and I found that you are not using any <sos> or <eos> tokens to mark the beginning and end of a SELFIES. At least I did not see them and maybe I just missed them.
EDIT: As you are only using a RNN for Decoding I am asking more about the <eos> token.

I was wondering what the reasons were to decide against using the above mentioned tokes?

Greetings,
Janosch

SELFIES for defects

Is it possible to extend selfies for description of defects in solids - somewhat like Kroeger-Vink type notation. May be it is just as simple as to use SELFIES code to define the part of ideal solid that is removed and give the selfie string for it, use SELFIES to describe the object that put in its place assuming that we know its bonding pattern, and somehow define the connections between them.

Generate molecule string from atom properties

Hello

is there a way with selfies to generate/synthesise a valid molecule string from a set of properties?
like atom type, hybridization, number of hydrogen bonds?
or generate molecular string given a set of parameters that one can sample?

Thanks

Irritation with 'selfies_to_encoding' and 'encoding_to_selfies' dictionary

Hi everyone,

i just wanted to draw attention to some irriation/minor bug when using the selfies_to_encoding and encoding_to_selfies function for integer tokenization of SELFIES.

I followed the instructions in your "Integer and one-hot encoding SELFIES" paragraph and created the vocab_stoi with shape {token: int} as you mentioned:

tokens = sf.get_alphabet_from_selfies(selfies)
vocab_stoi = {token: i for i,token in enumerate(tokens)}

With this dict, the selfies_to_encoding function works fine but however, the encoding_to_selfies function raises a KeyError because it expects a dict with shape {int: token} as i've seen in your code. An easy workaround would be the creation of a second dict with swapped keys and values like:

self.vocab_itos = dict((v,k) for k,v in self.vocab_stoi.items())

But however, i thought it would be nice to use the same vocab for tokenization and detokenization and avoid irritations concerning that functions. Thank you in advance!

how to reduce invalidity due to aromatic constraints without de-aromatize molecules

When I try to generate the molecules, this message, Can't kekulize mol. Unkekulized atoms, keeps jumping out. Is there any way to solve this problem without de-aromatize molecules?

Questions for the grammar

Hi,

Thanks for your effort to keep updating SELFIES. I'm confused the grammar in the recent release, as the encoding result is different from the original paper for the example molecule 3,4-Methylenedioxymethamphetamine (MDMA).

>>> import selfies as sf
>>> sf.encoder("CNC(C)CC1=CC=C2C(=C1)OCO2")                                                       
'[C][N][C][Branch1][C][C][C][C][=C][C][=C][C][=Branch1][Ring2][=C][Ring1][=Branch1][O][C][O][Ring1][=Branch1]'

I used RDKit version 2020.09.1.

Thanks in advance.

Translating SMILES to SELFIES

Hi all,

Very nice paper, I'm looking forward to learning more about the SELFIES representation and using them in my own generative models. I'm wondering if you could provide an example of translating a SMILES (or a molecular graph) to a SELFIES string? The example of going from SELFIES to SMILES was helpful but I am having trouble developing an intuitive sense for how certain SELFIES tokens are chosen. For instance, if we take this aromatic molecule from the ZINC dataset with an input SMILES of
Clc1ccccc1-c1nc(-c2ccncc2)no1

the resulting SELFIES is
[Cl][C][=C][C][=C][C][=C][Ring1][Branch1_2][C][=N][C][Branch1_1][Branch2_2][C][=C][C][=N][C][=C][Ring1][Branch1_2][=N][O][Ring1][O].

My assumption was that [Ring] tokens were placed before the atoms in that ring but that does not seem to be the case. I am also unclear on the meaning of having two consecutive [Branch] tokens and why there are two [O] tokens when there is only one oxygen atom in the molecule. Any tips for developing my intuition for SELFIES would be greatly appreciated!

Can I convert the SELFIES to 3D molecular geometry?

I read the manual find it no information about the convertion between SELFIES pattern and 3D molecular geometry.

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens

Is it possible to get a map which SMILES tokens were used to generate which SELFIES tokens (or v.v.)?

I am looking for a feature like this:

>>>smiles = 'CCO'
>>>encoder(smiles, get_mapping=True)
([C][C][O], [0,1,2])

In this simple example [0,1,2] would imply that the first SMILES token (C) is mapped to the first selfies token ([C]) and so on.

Motivation:
I think this feature could be very useful to close the gap between RDKit and SELFIES. One example are scaffolds. Say we have a molecule, want to retrieve its scaffold and decorate it with a generative model. With SMILES it's easy (see example below) but with SELFIES it's not possible (as far as I understand).

My questions:

Is it, in principle, possible to obtain such a mapping?
If yes, is there already a way to obtain it with the current package?
If no, is this a feature that seems worth implementing?

Discussion:
Such a mapping would imply a standardized way of splitting the strings into tokens. Fortunately, we have split_selfies already, but regarding SMILES, I think that the tokenizer from the Found in Translation paper!) could be a good choice since it's used widely. (I'm using that tokenizer in the example below.)

==== EXAMPLE ===
This is just the appendix to the post. It's an example for how to retrieve which SMILES tokens constitute the scaffold of a given molecule. As it appears to me, this is currently not possible with SELFIES.

First, some boring setup:

from rdkit import Chem
from selfies import encoder, decoder, split_selfies
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from pytoda.smiles.processing import tokenize_smiles
import re 

# Setup tokenizer
NON_ATOM_CHARS = set(list(map(str, range(1, 10))) + ['/', '\\', '(', ')', '#', '=', '.', ':', '-'])
regexp = re.compile(
    r'(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|'
    r'-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])'
)
smiles_tokenizer = lambda smi: [token for token in regexp.split(smi) if token]

Example molecule (left) and RDKit-extracted scaffold (right):

smiles = 'CCOc1[nH]c(N=Cc2ccco2)c(C#N)c1C#N'
mol = Chem.MolFromSmiles(smiles)
atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
scaffold = GetScaffoldForMol(mol)
# List of ints pointing to scaffold atoms as they occur in SMILES
scaffold_atoms = mol.GetSubstructMatches(scaffold)[0]
smiles_tokens =  smiles_tokenizer(smiles)


atom_id = -1
for token in smiles_tokens:
    if token not in NON_ATOM_CHARS:
        # Found atom
        atom_id += 1
        if atom_id in scaffold_atoms:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')
    else:
        # Non-Atom-Chars
        if (atom_id in scaffold_atoms and atom_id+1 in scaffold_atoms) or atom_id==scaffold_atoms[-1]:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')

Output will be:

C --> not on scaffold
C --> not on scaffold
O --> not on scaffold
c --> on scaffold
1 --> on scaffold
[nH] --> on scaffold
c --> on scaffold
( --> on scaffold
N --> on scaffold
= --> on scaffold
C --> on scaffold
c --> on scaffold
2 --> on scaffold
c --> on scaffold
c --> on scaffold
c --> on scaffold
o --> on scaffold
2 --> on scaffold
) --> on scaffold
c --> on scaffold
( --> not on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold
) --> not on scaffold
c --> on scaffold
1 --> on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold

Trying to achieve the same with SELFIES does not seem to work. This is because selfies.encoder does not fully preserve the order of the tokens passed. It preserves it to large extents (which is great) but around ring symbols it usually breaks. I feel like I would need to reverse-engineer the context free grammar to solve this.
Here would be the tokens in SMILES and SELFIES respectively:

C [C]
C [C]
O [O]
c [C]
1 [NHexpl]
[nH] [C]
c [Branch1_1]
( [Branch2_3]
N [N]
= [=C]
C [C]
c [=C]
2 [C]
c [=C]
c [O]
c [Ring1]
o [Branch1_1]
2 [=C]
) [Branch1_1]
c [Ring1]
( [C]
C [#N]
# [C]
N [Expl=Ring1]
) [=C]
c [C]
1 [#N]
C
#
N

When using sf.encoder to encode something that contains *, [(R6)n], [(Rc)p], [alkyl], [(CH2)n], etc., it will not be resolved and an error will be reported.

Although I can skip it, smiles like this make up a large part of my data set. The following are some examples of smiles that cannot be encoded:
C1=CC2=C(C(=CC(=C2)C(N3C4=NC(=NC(=C4N=N3)[R2])[R1])([R4])[R3])[(R6)n])N=C1
C(C(=[Z])C1[(CH2)n]N(C2=[M]C(=[L]C(=[K]2)N([R6a])[R5a])[Y])[(CH2)n]1)(C)m([R3])[R2]
C1CC2C(c(cC1N2C(=O)N3C4=C(C(=C(C(=C4[R1a])[R1b])[R1c])[R1d])[F,Cl,Br,I]C(C3[(R2a)p])[(R2b)r])[R4])[R3]
C1=C(C=C(C(=C1)N2C=C(C3=C(C=C(C=C32)[(Ra)m])O)[(Rb)n])[(Rc)p])O
C1=CC(=C2C(=C1Br)C=C(N2)[(Rb)n])[(Ra)m]
C1=C(C=C(C(=C1)[F,Cl,Br,I])[(Rc)p])O
CN1C(C[][(CH2)n]C1[Ra])[Rb]
C(C[F,Cl,Br,I])C[]CC1C(C(CC1([R1])[R2])([R3])[R4])[Y]C(C([R7])([R8])[Z])([R5])[R6]
CC(C)S(=O)(=O)C(C[R4])[(CH2)n]C1=CC(=C(C=C1)[R5])[R6]
C1=CC(=C(C=C1[(CH2)n]C(C[R4])S(=O)(=O)NC2=NC(=C([R1])S2)[R2])[R6])[R5]
C1=CC(=C(C=C1[(CH2)n]C(C[R4])C(=O)NC2=NC3=C(N=C(C=C3)[R3])S2)[R6])[R5]
CC(C)C1=CC=CC(=N1)[(R5)n]
CC(C)C1=NC=NC(=N1)[(R5)n]
CC(C)C1[(CH2)p]CC(C[(CH2)q]1)N([R])[][R][B]
CC(C)N1[(CH2)p]CN(C[(CH2)q]1)[R][]
CC(C)N1[(CH2)r]CC2([(CH2)s]CN([(CH2)q]2)[R][])[(CH2)p]1
C1=C(N=C(N)N=C1)N2[(CH2)p]C3C([(CH2)v]2)[(CH2)q]N([(CH2)t]3)[R][]
C1=C([R2])[X5][X4][X3]2X1C3[(CH2)n]C([R1])[(CH2)n]3
C1=CN2C(=CN=C2C=C1[R2])C3[(CH2)n]C([R1])[(CH2)n]3
C1=C([R2])[X5][X4][X3]2X1C3[(CH2)n]C(N)[(CH2)n]3
C1CN(C(CN1C2CN(C2)C(=O)[Z])[(R1)s])C(=O)[Y]
C(=C\1/C(=O)N([R2])[Y][F,Cl,Br,I]1)(/C2=[Z6][Z7]=C([Z5]2)[])[R3]
C(=C\1/C(=O)N(C(=[Q])[F,Cl,Br,I]1)[R2])(/C2=[Z1]C(=[Z4][Z3]=[Z2]2)[])[R3]
C1=C(C=C(C=C1/C(=C/2\C(=O)N(C(=[Q])[F,Cl,Br,I]2)[R2])/[R3])[])[R1]
C(=C\1/C(=O)N(C(=[Q])[F,Cl,Br,I]1)[R2])(/C2=[Z6][Z7]=C([Z5]2)[])[R3]
C1=C(/C(=C/2\C(=O)N(C(=[Q])[F,Cl,Br,I]2)[R2])/[R3])OC(=C1[R1])[]
C(=C\1/C(=O)N(C(=[Q])[F,Cl,Br,I]1)[R2])(/C2=[Z10]C(=[Z9][Z8]2)[])[R3]
C1=C(/C(=C/2\C(=O)N(C(=[Q])[F,Cl,Br,I]2)[R2])/[R3])SC(=C1[])[R1]
C1(=C(C2=[D3]D2[(R3)n])[F,Cl,Br,I]C(=C1[R2])C3=[Y][Z]=D6[D5]=[D4]3[(R4)q])[R1]
C1=C(C2=C(C(=C(C3=[Y][Z]=C(C=C3[(R4)q])[L2])[F,Cl,Br,I]2)[R2])[R1])B=[]C(=C1[(R3)n])[L1]
133it [00:00, 1223.44it/s]
C1(=C(C2=D3[D2]=[D1]*[L1])SC(=C1[R2])C3=[Y]Z[L2])[R1]
C1=C(C(=NZ[L2])C2=C(C(=C(C3=C(C=C([L1])[]=N3)[(R3)n])S2)[R1])[R2])[(R4)q]
CC1cc[(CH2)n]OC1C
Cc1c(CCNC(=O)[R1b])C2=C(C=CC3=C2[Ea][(CH2)n]O3)[X']1
C=[CH]C1CC(C(C1)[M3][M2][M1]Y[F,Cl,Br,I])[(CH)n]=C
[R4][(Ph)a][][H][R3]
c1cc(ccc1/C=C/[R3a])c2ccc(cc2)[alkyl]
c1c([H][(H)r][R0])cc(c(c1[Y1])[X0])[Y2]
c1c(cc(c(c1F)C(F)(F)Oc2cc(c(c(c2)[Y1])[X0])F)[Y2])[B][][R0]
c1c([H][H][alkenyl])cc(c(c1[Y1])[X0])[Y2]
CC@H/C=C(\C=C[C@H]1C@([Ra])[R'])/[a]
[XH][Y][(Z)n]SS[(Z)n][Y][XH]
NO[(CH2)n][Y]
C1=CC(=C(C(=C1[F,Cl,Br,I])C2=C(C(=NN(C2=O)[B])[])O[G])[Z])[Y]
C1=C(C=[F,Cl,Br,I]N(C1=O)C2=C(C=C3C4=C(C5N(C(C4)[R4])[(CH2)r]5)N(C3=[Z]2)[R1])[(R5)s])[L]B([R14])([R13])[R15]
C1=C(C=[F,Cl,Br,I]N(C1=O)C2=CC(=C3C4=C(CC5N(C4[R4])[(CH2)r]5)N(C3=[Z]2)[R1])[(R5)s])[L]B([R14])([R13])[R15]
C1C([(CH2)p]CN1[Q][B][])[]
C1COCC(CC2=C[3H]C3=C2C(=[V]C(=[]3)[R23])[R24])N1C(=S)N
C1(=C(O)[]=C([R24])[V]=C1[R23])I
205it [00:00, 995.69it/s]
C1C(CN2C([R2])[(CH2)m]*[K]C3=[F,Cl,Br,I]C=[]C4=[V]C=C([R1])[U]=C34)OC(=O)N1[G]
C1=C([R1])[U]=C2C(=[F,Cl,Br,I]C=[]C2=[V]1)[L1]
C1=C([R1])[U]=C2C(=[F,Cl,Br,I]C=[]C2=[V]1)N
C(C1=[F,Cl,Br,I]C=[]C2=[V]C=C([R1])[U]=C12)[L4]
C1=CC(=CC(=C1)[(R19)w])C(=O)NC2=C(NN=C2C(=O)NC3CCN(CC3[(R11)r])[R14a])[R2]
C1=Nc2c(C(=N1)[X1][])n[]c2[R2]
C1=CC=C(C=C1)[Zf]OC2=CC=C(C=C2)N(C3=NC=Nc4c3n[]c4[R2f])[R3f]
C1CC2=CC3=C(C(=NC=N3)[X1][])N2C1
C1CCC2=CC3=C(C(=NC=N3)[X1][])N2CC1
C1CN2C3=C(C=[]2)N=CN=C3N1[Y1][]
C1=C(N(C2=C1N=CN=C2[X1][])[R2])[R1]
C1=C2C(=NN1[R2])C(=NC=N2)[X1][]
CC(C)C1=NN2C(=C1)[]2
CC(C)C12=CNN=C1[]2
C=C1C(N(C(=[])N1[Het])[R1])([R2])[R3]
C1=CC(=CC(=C1)B(C2=CC(=CC=C2)[(R2)n])O)[(R2)n]
CC1=C2C(=C(C)S1)[Y][]O2
c1cc2(C(=O)C3=C([C@]4(C@(C@@H[R5])C(C3C(c2(cc1[(R7)n])[R8])([R2])[R1])([R3])[R4])O)O)[R6]
C12=C(C(=C3C(=C1[R7])[(X)n]3)[R7])C([C@]4([H])C(=C([C@]5(C@(C@@H[R5])C4([R3])[R4])O)O)C2=O)([R2])[R1]
C1=CC(=C2C(=C1[(Ra)m])C=C(C(=N2)[(Rb)n])[F,Cl,Br,I][Ar])[R]
C1=C2C(=C(C(=C1)[(RIII)n])[B]N([R]I)[R]I)N=C(C(=C2[R][V])S(=O)(=O)[])[R]I
C1(N([R4])[]C([R])[(CH2)p]1)[(R6)q]
C([R])([R2])[]N([R3])[R4]
C1=CC(=C2C=C(C(=NC2=C1C([R2])([R1])[]N([R3])[R4])[(Rb)n])[F,Cl,Br,I][Ar])[(Ra)m]
C1=CC(=C2C(=C1[(Ra)m])C=C(C(=N2)[(Rb)n])S(=O)(=O)[Ar])[R]
C1=CC(=C2C(=C1[(Ra)m])C=C(C[Ar])C(=N2)[(Rb)n])[R]
C1=CC(=C2C=C(C(=NC2=C1C3CC(N(C3)[R4])[(R6)q])[(Rb)n])S(=O)(=O)[Ar])[(Ra)m]
C1CC(C2=C3C(=C(C=C2)[(Ra)m])C=C(C(=N3)[(Rb)n])S(=O)(=O)[Ar])C(CN(C1)[R4])[(R6)q]
CC(C)(C)(C)s([R3a])[R3b]
C1(=C(C(=C2C(C(C(C(N12)[R3])[R4])[R5])[R6])[R7])[R2])C(=O)C(=O)N([R8])[(A1)n][L1][R1]
C1COCC(N1C2=NC3=C(C(=N2)C4=C([2H]=B[]=C4[R3])[R4])N=C(N3[F,Cl,Br,I][R2])[R1])[(R2)k]
C1COCCN1C2=NC3=C(C(=N2)C4=C(C(=C5C(=C4[R3])NC(=N5)[(R11)n])[R7])[R4])N=C(N3[F,Cl,Br,I][R2])[R1]
C1=CC(=CC(=C1)CN2C=NC3=C2N=C(N=C3C4=CC=CC(=C4)O)N5CCOCC5)[(R13)p]
C[(CH2)n]N1CCF,Cl,Br,I([R4])[R5]
C[(CH2)n]N1C2CCC1CC(C2)C3=CC=CC=C3
C[C@@]1(CC[Z][(CH2)p]1)[R8]
CC1CC2C(C(C)C3(CCC4C5CC[C@@]6([H])CC(=O)CCC6(C)C5CC4=C(C)[(CH2)n]3)O2)N(C1)[R30]
C1=C(C=C(C2=C1N=C(N2[L1][Ar1])[Ra])[Rm])[(L2)n][Ar2]

thank you for your reply！

Constants.py

Attribution Map Encoder Ordering

The attribution map from encoding is sorted by SMILES token, which makes it difficult to align SELFIES string with the attribution map. See comment here.

I will work on this when I have some time 😅

SELFIES encoding bug

Dear @MarioKrenn6240 ,

I have an issue with SELFIES encoding. In the following SMILES, when encoded into SELFIES, the SELFIES contains a '[N]' token. In the original SMILES, there are no 'N' atoms. It would be helpful if you could clarify why that is so. When we decode the SELFIES it does decode back to the original SMILES.

SMILES : 'C1C(C(OC2=CC=CC=C21)C3=CC=CC=C3)O'
SELFIES : '[C][C][Branch2_1][Ring1][Branch1_3][C][Branch1_1][N][O][C][=C][C][=C][C][=C][Ring1][Branch1_2][Ring1][Branch2_3][C][=C][C][=C][C][=C][Ring1][Branch1_2][O]'
Decoded : 'C2C(C(OC1=CC=CC=C12)C3=CC=CC=C3)O'

-Kohulan

Issue with decoding an encoded string.

Hello @MarioKrenn6240 ,

While working on a few of our predicted results I came across a problem.

SELFIES version: 1.0.3

1. for the molecule:https://pubchem.ncbi.nlm.nih.gov/compound/145735106

Canonical SMILES string: CI(C)I(C)I(C)C (from pubchem)
SELFIES: [C][I][Branch1_1][C][C][I][Branch1_1][C][C][I][Branch1_1][C][C][C]
Decoded SMILES: CI

2. for the molecule:https://pubchem.ncbi.nlm.nih.gov/compound/145092330

Canonical SMILES string:FI(F)C1=CC=2NC(C)CN(C2C=C1C=3C=NN(C3)C)C (Generated using CDK)
SELFIES: [F][I][Branch1_1][C][F][C][=C][C][N][C][Branch1_1][C][C][C][N][Branch2_1][Ring1][C][C][Expl=Ring1][Branch1_3][C][=C][Ring1][O][C][C][=N][N][Branch1_1][Ring2][C][Expl=Ring1][Branch1_1][C][C]
Decoded SMIELS: FI

3. for the molecule:https://pubchem.ncbi.nlm.nih.gov/compound/144364443

Pubchem canonical SMILES

Canonical SMILES string: CC1=CC=IOC2=C(CCCC3=C2C=CC(=C3)C)C=C1
SELFIES: [C][C][=C][C][=I][O][C][=C][Branch1_1][P][C][C][C][C][=C][Ring1][Branch1_3][C][=C][C][Branch1_2][Ring2][=C][Ring1][Branch1_2][C][C][=C][Ring2][Ring1][Ring1]
Decoded SMILES: CC=CCI

Canonical SMILES generated using CDK

Canonical SMILES string: O1I=CC=C(C=CC2=C1C=3C=CC(=CC3CCC2)C)C
SELFIES: '[O][I][=C][C][=C][Branch2_1][Ring1][Branch1_3][C][=C][C][=C][Ring1][Branch2_2][C][C][=C][C][Branch1_2][Branch2_3][=C][C][Expl=Ring1][Branch1_2][C][C][C][Ring1][O][C][C]'
Decoded SMILES: OI

I quite didn't understand what is happening here. and I am not using any modifications to the code.

Another case:

SELFIES: [Cl][Cl][=C][C][C][=C][C][=C][C][=C][C][C][=C][C][Branch1_1][Ring2][C][Expl=Ring1][=C][=C][Ring1][=N][C][Ring1][O][Expl=Ring1][Branch1_3]
Decoded SMILES: ClCl

It seems to be a minor bug, maybe a fix would help us in our evaluations.

-Kohulan.R

One-hot encoding functions

The functions for one-hot encoding SELFIES strings found are really helpful, and I think it might be useful to clean them up and include them in selfies.utils or in a new selfies.data_loader module. Some functions to reverse the encodings would be great too. I'm happy to give that a shot and open a PR if there's interest.

aspuru-guzik-group / selfies Goto Github PK

selfies's Introduction

SELFIES

Installation

Usage

Overview

Examples

Translation between SELFIES and SMILES representations:

Very simple creation of random valid molecules:

Integer and one-hot encoding SELFIES:

Customizing SELFIES:

Explaining Translation:

More Usages and Examples

Tests

Version History

Credits

License

selfies's People

Contributors

Stargazers

Watchers

Forkers

selfies's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs