GithubHelp home page GithubHelp logo

Comments (8)

ndickson-nvidia avatar ndickson-nvidia commented on August 16, 2024 2

Yep, I'm calling the same RDKit C++ function that eventually gets called by to_inchikey_non_standard, with the same options, so the only difference from unique_id should be that I'm skipping the MD5 hash:

RDKit::SmilesParserParams params;
std::unique_ptr<RDKit::RWMol> mol{ RDKit::SmilesToMol(smiles_string, params) };
const std::string inchiKeyString = MolToInchiKey(*mol, "/FixedH /SUU /RecMet /KET /15T");

from datamol.

maclandrol avatar maclandrol commented on August 16, 2024 1

Main purpose is to differentiate it from InChIKey and especially prevent people from interpreting it as InChIKey since we are using a non-standard InChI, which try to improve uniqueness of tautomeric forms of molecules. As the original InChIKey format interpretation and structure might not be respected here, we re-hash to avoid any misinterpretation.

from datamol.

ndickson-nvidia avatar ndickson-nvidia commented on August 16, 2024

The IUPAC source describing the InChIKey format is here: https://www.inchi-trust.org/technical-faq/#13.1

An InChIKey should always be 27 characters, the 2 dash characters are redundant, and the "FV" characters might always be "NA" in the unique_id function (N for non-standard, and A for InChI version 1). Even without removing the FV characters, 25 uppercase characters can be encoded in 16 bytes fairly easily (or 15 bytes slightly less easily).

from datamol.

hadim avatar hadim commented on August 16, 2024

@maclandrol said it all - the only reason for it is to avoid confusion with the regular inchikey.

@DomInvivo I would actually recommend you to move to dm.hash_mol instead, which is a natively supported way of hashing a molecule from rdkit (contribution from folks at Schrödinger).

from datamol.

hadim avatar hadim commented on August 16, 2024

Closing here but feel to re-open!

from datamol.

ndickson-nvidia avatar ndickson-nvidia commented on August 16, 2024

No worries! The context was that Graphium currently uses:

mol = dm.to_mol(mol=smiles)
mol_id = dm.unique_id(mol)

to get a unique ID from a SMILES string for identifying if a molecule occurs in multiple datasets. I'm in the process of trying to move most of Graphium's dataset preparation to C++. It wouldn't need to use Datamol for this anymore, so I don't really need any changes in Datamol, and creating an InChIKey from a molecule is probably much slower than doing an MD5 hash. My current plan is to compact the 25 letters of each InChIKey into a pair of 64-bit integers, (26^13 < 2^64, so it's pretty simple), and use the pairs as keys, instead of using 32-character strings as keys.

from datamol.

hadim avatar hadim commented on August 16, 2024

@ndickson-nvidia ok, sounds good, and thanks for the context. Your approach seems legit and fits into a 64-bit integer but I just want to raise the point about using "standard" inchikey.

It's very subtle, but "standard" inchikey cannot differentiate in between tautomer (two molecules that look very similar but where a few hydrogens are located at different positions of the graph). In certain case, it's totally fine, but in other cases it might not be what you want depending on the downstream application.

To give you more context about the above, I am copying/pasting here a small blurb I have made in the past:


[... the topic of the convo is about finding a good way to generate unique ID for a molecular dataset described as SMILES..]

First, SMILES is likely a suboptimal choice because of the variable length, the SMILES generation algorithm might change depending on your chemoinformatics lib but also will change across different rdkit versions (see for example rdkit/rdkit#4919 (comment)).

Inchi and inchikey are an international standard that is well respected within the field and any implementation (rdkit or other softwares) will generate the same inchi/inchikey for the same molecule.

That being said, inchi and inchikey in their default implementation do not preserve the hydrogen layer information. In consequence, two tautomer of the same molecule will generate the exact same inchikey. See the example below:

import datamol as dm

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.to_inchikey(mol1)  # VERCQLOBLOLFMW-UHFFFAOYSA-N
dm.to_inchikey(mol2)  # VERCQLOBLOLFMW-UHFFFAOYSA-N

If you want to differentiate between tautomer (and here it really depends on the use case because sometime this is not wanted), then you can use a non-standard Inchikey version that considers the hydrogen layers. Most of the software does not allow generating those inchikey, and you should be very careful as they look similar to the standard inchikey (so be sure to document this not a standard inchikey).

In datamol, we've added a function to generate such non-standard InChIKey and added a simple md5 function on top of it to prevent the confusion with standard inchikey. This is what we call the unique_id:

import datamol as dm

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.unique_id(mol1)  # defe59076164fa24a3bbc365dc965b4b
dm.unique_id(mol2)  # 6316e004feb731d6981b5e28d3ae3f98

We've developed unique_id a few years ago, but more recently folks at Schrödinger added a new molecular hashing function directly within rdkit that provides the same types of features. PR at rdkit/rdkit#5360. You can also find a RDKIT UGM talk about it at https://github.com/rdkit/UGM_2022/blob/main/Presentations/nealschneider_introducing_registrationhash_lightning.pdf

That function is also provided on datamol:

mol1 = dm.to_mol("CN=c1cc[nH]cn1")
mol2 = dm.to_mol("CN=c1ccnc[nH]1")

dm.hash_mol(mol1)  # 94f493d321cbd83c888e5f3f7e2e8054f956499f
dm.hash_mol(mol2)  # 0ef289740c9ba4d7af02c5d86fc2139fe938739a

So to summarize, before deciding on an ID I would recommend first defining exactly what it means for two molecules to be identical in your particular context because this is a loose definition (due of the intrinsic physical, 3D and dynamic aspect of molecular systems).

I would also recommend storing a SMILES for every data points, so you can easily reconstruct the molecular object in addition to one or multiple ID columns (inchikey, unique_id, hashmol, etc).

Hope it helps!

from datamol.

hadim avatar hadim commented on August 16, 2024

Now back to your use case. Assuming you want to generate mol ID that are sensitive the hydrogens layer. Then I don't think you can use molhash from rdkit since the implementation is on the Python side.

That being said, the inchikey from rdkit lives on the C++ side, so you could easily tune the options to generate a non-standard inchikey if you want to from C++. See the equivalent datamol function for help choosing the right parameters:

def to_inchikey_non_standard(

Hope it helps!

from datamol.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.