datamol-io / datamol Goto Github PK

View Code? Open in Web Editor NEW

437.0 17.0 46.0 116.59 MB

Molecular Processing Made Easy.

Home Page: https://docs.datamol.io

License: Apache License 2.0

Python 97.20% Jupyter Notebook 2.80%

python molecule rdkit drug-discovery drug-design molecules cheminformatics medicinal-chemistry

datamol's People

Contributors

Stargazers

Watchers

datamol's Issues

Activity unit conversion utility functions

The most obvious is XC50 <-> pXC50. I think it's within the scope of datamol but you could argue the opposite.

Let me know what you think @maclandrol

Stereoisomer enumeration should have the unique flag

opts = StereoEnumerationOptions(unique=True)

Enumerate tautomers and stereoisomers

We should backport the function from openff-toolkit for that:

Options in `dm.to_df`

Include includePrivate, includeComputed from Mol.GetPropsAsDict(). Leave default the same as rdkit.

The JobRunner with batch sizes doesn't work with arrays/tensor

This line of code raises an error if the element is not a sequence when using a batch size. This prevents the code from running using numpy arrays or tensors.

https://github.com/datamol-org/datamol/blob/1c2041324c5d5ba660fd71cc46d1dc67cdb979a0/datamol/utils/jobs.py#L157

Enable multi versionning of the doc

The best would be to keep hosting the doc on GH. We can try to use the mkdocs mike plugin.

If that does not work, let's switch to readthedocs.

Viz for atom highlighting

From list of list of atom indices: https://gist.github.com/hadim/3de1e9ed34abf1c126d53798c6face3f
From a SMARTS: https://gist.github.com/maclandrol/24c0372a54e08e1d4e31528fd4d9af79

Chemical structure curation

Found this curation pipeline from chembl very useful and worth being a default curation step.
Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7458899/#CR28
Code: https://github.com/chembl/ChEMBL_Structure_Pipeline

Do you think it's worth adding or reimplement in datamol?

Add sanitize arg to `read_sdf`

See title.

Add ESOL descriptor

def calc_ap(mol: dm.Mol):
    """Calculate aromatic proportion #aromatic atoms/#atoms total."""
    aromatic_query = dm.from_smarts("a")
    matches = mol.GetSubstructMatches(aromatic_query)
    return len(matches) / mol.GetNumAtoms()


def compute_esol(data: pd.DataFrame):
    """Compute esol from
    https://github.com/PatWalters/solubility/blob/d1536c58afe5e0e7ac4c96e2ffef496d5b98664b/esol.py

    Dataframe must have the following columns: clogp, mw, n_rotatable_bonds
    """

    intercept = 0.26121066137801696
    coef = {
        "mw": -0.0066138847738667125,
        "clogp": -0.7416739523408995,
        "n_rotatable_bonds": 0.003451545565957996,
        "ap": -0.42624840441316975,
    }

    # Compute aromatic proportion
    aromatic_proportion = data["mol"].apply(calc_ap)

    # Compute ESOL
    esol = (
        intercept
        + coef["clogp"] * data["clogp"]
        + coef["mw"] * data["mw"]
        + coef["n_rotatable_bonds"] * data["n_rotatable_bonds"]
        + coef["ap"] * aromatic_proportion
    )

    return esol

adapt to be used into the dm.descriptors module.

More viz functions

Build an intuitive and simple mol drawing API based on http://rdkit.blogspot.com/2020/04/new-drawing-options-in-202003-release.html
Add more options to the existing viz functions (blackWhite, bondLineWidth, showAtomsNumber, Kekule, etc)
Add more viz functions such as radialscope, tmap, umap, etc

ping @craigmichaelm

Make ipywidgets optional

In https://github.com/datamol-org/datamol/blob/master/datamol/viz/_conformers.py

Fix types for path

In dm.io.read_sdf the type of urlpath is Union[str, pathlib.Path, TextIO] but it should be Union[str, os.PathLike, TextIO]. Same for others functions.

Support more fingerprint types

In dm.to_fp, add a fp_type argument in order to choose what fp you want to compute from http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html

Add an `mol_obj` boolen arg to `dm.to_df`

so one column contains the rdkit mols (when loading from and sdf inject the molecule read by rdkit directly so conformers are preserved).

Add wrapper functions for various rdkit descriptors

Add parsing of full MDL Mol block specification.

The RDKit SDF/MDL parser does not parse all atom properties specified using the MDL format.

Implementation resources:

Progress bar in `JobRunner` not really working

I feel like the progress bar is filled upon job submission and not when a job has been terminated. I fixed that in a very early version of JobRunner that wasn't relying on joblib.

Missing `standardized_mol` function

The Quick API tour has the following call

mol = dm.standardized_mol(mol)

However it fails as datamol has no attribute standardized_mol.

I did not find any reference to standardized_mol in the code. Is it maybe a planned feature?

Add cxsmiles support when saving to CSV files

ping @MichelML @maclandrol

New rdkit version

Test the new version of rdkit and re-enable CI matrix once the repo is public.

Define security policy and setup code scanning alerts

See https://github.com/datamol-org/datamol/security

security policy
code scanning alerts

Tutorials

Make more tutorials:

working with reactions

Use np.frombuffer to convert fp to array

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import DataStructs

import numpy as np

mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")

fp_vec = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol,
    radius=2,
    nBits=2048,
    useFeatures=True,
)

%%timeit -n 1000 -r 1000
# 10.1 µs ± 3.62 µs per loop (mean ± std. dev. of 1000 runs, 1000 loops each)
np.frombuffer(fp_vec.ToBitString().encode(), 'u1') - ord('0')

%%timeit -n 1000 -r 100
fp = np.zeros((0,), dtype=int)
39.6 µs ± 1.45 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each)
DataStructs.ConvertToNumpyArray(fp_vec, fp)

On my machine np.frombuffer is 4 times faster than ConvertToNumpyArray. This is a pretty quick benchmark so I might also be missing something.

Allow to set `total` in `dm.parallelized`

An error is raised when you provide total to tqdm_kwargs. It's convenient to be able to specify it when processing a very large dataframe for example since the most efficient way is to iterate on the rows (meaning the count of the iterator is not known).

Instead of:

rows = data.iterrows()
rows = list(rows)   # memory and CPU intensive
data = dm.parallelized(process_fn, rows, n_jobs=-1, progress=True, arg_type="args")

you would do:

data = dm.parallelized(process_fn, data.iterrows(), n_jobs=-1, progress=True, arg_type="args", total=len(data))

Fix conformers generation

smiles = "CCCC"
mol = dm.to_mol(smiles)
dm.conformers.generate(mol, n_confs=1, minimize_energy=True, rms_cutoff=None)

and

They both fails doing [mol.AddConformer(conf, assignId=True) for conf in confs] probably because RemoveAllConformers() is called just before.

It's a null pointer exception.

Bump CI to latest rdkit version (2021.09.*)

Support CXSmiles

Probably as a bool flag in dm.to_smiles and dm.to_mol? or in separated function.

Parallelize cdist and pdist

For large dataset. See https://gist.github.com/rtavenar/a4fb580ae235cc61ce8cf07878810567 for a snippet.

More descriptors

import datamol as dm
from rdkit import Chem


def _in_range(x, min_val: float = -float("inf"), max_val: float = float("inf")):
    """Check if a value is in a range
    Args:
        x: value to check
        min_val: minimum value
        max_val: maximum value
    """
    return min_val <= x <= max_val


def _compute_ring_system(mol: dm.Mol, include_spiro: bool = True):
    """
    Compute the list of ring system in a molecule. This is based on RDKit's cookbook:
    https://www.rdkit.org/docs/Cookbook.html#rings-aromaticity-and-kekulization

    # EN: move to datamol

    Args:
        mol: input molecule
        include_spiro: whether to include spiro rings. Defaults to False.

    Returns:
        ring_system: list of ring system
    """
    ri = mol.GetRingInfo()
    systems = []
    for ring in ri.AtomRings():
        ringAts = set(ring)
        nSystems = []
        for system in systems:
            nInCommon = len(ringAts.intersection(system))
            if nInCommon and (include_spiro or nInCommon > 1):
                ringAts = ringAts.union(system)
            else:
                nSystems.append(system)
        nSystems.append(ringAts)
        systems = nSystems
    return systems


def _compute_charge(mol: dm.Mol):
    """
    Compute the charge of a molecule.

    Args:
        mol: input molecule

    Returns:
        charge: charge of the molecule
    """
    return Chem.rdmolops.GetFormalCharge(mol)


def _compute_refractivity(mol: dm.Mol):
    """
    Compute the refractivity of a molecule.

    Args:
        mol: input molecule

    Returns:
        mr: molecular refractivity of the molecule
    """
    return Chem.Crippen.MolMR(mol)


def _compute_rigid_bonds(mol: dm.Mol):
    """
    Compute the number of rigid bonds in a molecule.

    Args:
        mol: input molecule

    Returns:
        n_rigid_bonds: number of rigid bonds in the molecule
    """
    # rigid bonds are bonds that are not (single and not in rings)
    # I don't think this is the same as the number of non rotatable bonds ?
    # EN: move this to datamol ?
    non_rigid_bonds_count = dm.from_smarts("*-&!@*")
    n_rigid_bonds = mol.GetNumBonds() - len(
        mol.GetSubstructMatches(non_rigid_bonds_count)
    )
    return n_rigid_bonds


def _compute_n_stereo_center(mol: dm.Mol):
    """
    Compute the number of stereocenters in a molecule.

    Args:
        mol: input molecule

    Returns:
        n_stero_center: number of stereocenters in the molecule
    """
    n_stereo_center = 0
    try:
        Chem.FindPotentialStereo(mol, cleanIt=False)  # type: ignore
        n_stereo_center = Chem.rdMolDescriptors.CalcNumAtomStereoCenters(mol)
    except:
        pass
    return n_stereo_center


def _compute_n_charged_atoms(mol: dm.Mol):
    """
    Compute the number of charged atoms in a molecule.

    Args:
        mol: input molecule

    Returns:
        n_charged_atoms: number of charged atoms in the molecule
    """
    return sum([at.GetFormalCharge() != 0 for at in mol.GetAtoms()])

Unique molecule ID taking into account tautomerism

We are looking into generating a unique molecular ID that will give different IDs if two molecules are different tautomeric forms.

Features should be:

fast: we potentially want to apply it at large scale
consistent mol → ID: we should be able to recompute an ID from a given molecule (can't be random neither auto increment)
stable across python/rdkit and datamol versions: a unique ID should still be the same in a couple of years.
differentiate tautomeric forms

An elegant way would to use a non-standard inchi that would include a hydrogen layer /f. See https://en.wikipedia.org/wiki/International_Chemical_Identifier for details.

Here is a proof of concept that it works:

import datamol as dm

from rdkit import Chem

# SMILES: "N=C(N)O"
inchi3 = "InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/f/h2,4H,3H2/b2-1?"

# SMILES
inchi4 = "InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/f/h2-3H2"

mol3 = dm.from_inchi(inchi3)
mol4 = dm.from_inchi(inchi4)

print(dm.to_smiles(mol3))  # N=C(N)O
print(dm.to_smiles(mol4))  # NC(N)=O

# generated inchi and inchikey remain the same since the hydrogen layer has been removed

print(dm.to_inchi(mol3))  # 'InChI=1S/CH4N2O/c2-1(3)4/h(H4,2,3,4)'
print(dm.to_inchi(mol4))  # 'InChI=1S/CH4N2O/c2-1(3)4/h(H4,2,3,4)'

print(dm.to_inchikey(mol3))  # 'XSQUKJJJFZCRTK-UHFFFAOYSA-N'
print(dm.to_inchikey(mol4))  # 'XSQUKJJJFZCRTK-UHFFFAOYSA-N'


# when computing the inchikey directly from the non-standard inchi with the hydrogen layer
# those are different

print(Chem.inchi.InchiToInchiKey(inchi3))  # XSQUKJJJFZCRTK-ZIALIONUNA-N
print(Chem.inchi.InchiToInchiKey(inchi4))  # XSQUKJJJFZCRTK-UBUOBULFNA-N

But currently I haven't found a way to actually generate that hydrogen layer with rdkit. It seems to me that rdkit can only read the hydrogen layer and not generate it. I think this is purposely, so people stick to the standard inchi since using /f is non-standard.

To make it work we would have to find a way to compute the hydrogen layer, append it to the inchi and then use Chem.inchi.InchiToInchiKey(inchi3) to compute the inchikey from the inchi.

This is just a proposal, feel free to throw alternative ideas.

ping @maclandrol @craigmichaelm @zhu0619

Add to "projects using rdkit"

https://github.com/rdkit/rdkit#projects-using-rdkit

Propagate drawing options to `Draw.MolsToGridImage` in` dm.to_image`

It will allow plotting SMILES or SMARTS without a correct aromaticity perception to be displayed without throwing an error.

Batch size does not work as expected

I also get another problem when using batch_size with the JobRunner.

My expectation is that, given a list of 1000 elements and batch_size=10, the job runner should automatically split the list into 100 sub-lists of 10 elements each. Then, I expect each sub-list to be processed sequentially.

However, instead of sequentially passing each element of the sub-list into the function, it passes the full sub-list. Thus, if the function doesn't handle lists, it crashes.

Maybe this is done on purpose, but I don't think it's the right behavior.

To reproduce:

import datamol as dm

def fun(a):
    print(a)

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

dm.utils.parallelized(fun, a, batch_size=2)

# RESULT
[0, 1]
[2, 3]
[6, 7]
[8, 9]
[4, 5]

# EXPECTED (in any order)
0
1
3
4
6
7
5
2
9
8

Remove the `datamol.actions` module

I don't think it's used.
It's not tested.

What do you think @maclandrol ?

Search capabilities: exact, substructure, similarity

I was wondering if a search api would be suited for datamol.

A starting API along the lines of:

def sim_search(
    mols: list[dm.Mol],
    qmol: list[dm.Mol],
    threshold: float = .7,
    fp_type: str = "ffp2"
 ) -> list[Tuple[dm.Mol, float]]:
    pass

def substruct_search(
    mols: list[dm.Mol],
    qmol: list[dm.Mol],
    atoms_and_bonds: bool = False,
 ) -> Union[list[dm.Mol], list[Tuple[dm.Mol, list[int], list[int]]]:
    pass

def exact_search(
    mols: list[dm.Mol],
    qmol: list[dm.Mol],
    unique: bool = True
 ) -> Union[dm.Mol, list[dm.Mol]]:
    pass

What do you think?

Better parallel

The current parallelized function is based on joblib which has pros but also cons. For example joblib tunes the batch size automatically which is nice.

The two major cons are: 1) it's not possible to simply call logger.info (it's probably possible with some tweaking) 2) exceptions are not propagated and so it makes debugging harder (you can always go back to n_jobs=1).

Here is a snippet that uses concurrent.futures as a backend instead of joblib:

from typing import Callable
from typing import Sequence
from typing import Any
from typing import Optional

from concurrent import futures
from loguru import logger


def parallelized(
    fn: Callable,
    inputs_list: Sequence[Any],
    scheduler: str = "processes",
    n_jobs: Optional[int] = -1,
    worker_id_key: Any = None,
    progress: bool = True,
    progress_auto: bool = True,
    log_when_done: bool = False,
):

    if not isinstance(inputs_list[0], dict):
        raise ValueError(f"inputs_list elements must be a dict not {type(inputs_list[0])}")

    if progress_auto:
        from tqdm.auto import tqdm
    else:
        from tqdm import tqdm

    if scheduler == "processes":
        pool_executor_cls = futures.ProcessPoolExecutor
        executor_kwargs = dict(max_workers=n_jobs)
        as_completed_fn = futures.as_completed

    elif scheduler == "threads":
        pool_executor_cls = futures.ThreadPoolExecutor
        executor_kwargs = dict(max_workers=n_jobs)
        as_completed_fn = futures.as_completed

    else:
        raise ValueError(f"Wrong scheduler: {scheduler}")

    if worker_id_key:
        worker_id_fn = lambda x: x[worker_id_key]
    else:
        worker_id_fn = lambda x: x

    with pool_executor_cls(**executor_kwargs) as executor:

        futures_list = {executor.submit(fn, **kwargs): kwargs for kwargs in inputs_list}

        results = []
        tqdm_args = {}
        tqdm_args["total"] = len(inputs_list)
        tqdm_args["disable"] = not progress

        for i, future in enumerate(tqdm(as_completed_fn(futures_list), **tqdm_args)):
            kwargs = futures_list[future]
            try:
                result = future.result()
                results.append(result)
            except Exception as ex:
                logger.info(f"Error running the following worker '{worker_id_fn(kwargs)}':", exc_info=1)  # type: ignore
                logger.info("Shutting down the execution...")
                executor.shutdown(True)
                raise ex
            else:
                if log_when_done:
                    logger.info(f"Execution done for '{worker_id_fn(kwargs)}' ({i+1}/{len(inputs_list)})")

    return results

It only supports kwargs as input type.

joblib also does a lot of useful magic under the hood that concurrent.futures is not doing so if we want to ship that function we could add it as dm.parallelized2 (or similar).

Switch to `useRandomCoords` by default for `dm.conformers.generate`

See https://greglandrum.github.io/rdkit-blog/conformers/exploration/2021/01/31/looking-at-random-coordinate-embedding.html for the rational.

Wht do you guys think? @maclandrol @JDavid04

New align mol

Based on manu's gist: https://gist.github.com/maclandrol/dd5695299107c1e167d6d1f671a786d1

Add a `dm.utils.fs` module

A module based on fsspec.

I am a bit hesitant doing that TBH since it's out of the scope of datamol. But it would actually avoid a lot of fs module duplication we start to have internally and is also potentially useful to anyone using datamol since it often involves manipulating remote data (in the same spirit we already have dm.utils.parallelize).

The alternative will be to integrate such a module into fsspec. But it: 1) will take more time 2) probably will probably have to be modified in order to fit the fsspec one.

@maclandrol any strong opinion on this?

Parallelize `dm.scaffold.fuzzy_scaffolding`

It's probably easy to do as there is a couple of independent for loop in the function.

Make sure all points of community section of Insights tab are green

see https://github.com/datamol-org/datamol/community . Missing are:

Code of conduct
Contributing
Issue templates
Repository admins accept content reports

Wrong type for `mol` in `dm.to_mol`

It's mol: str, while it should be mol: Union[str, dm.Mol],.

From datamol 0.6.0 forwards, required rdkit version should be >2021.09.1

An update for the viz.to_image was introduced in 0.6.0. But there was this rdkit bug: rdkit/rdkit#3101 that was only fixed in Release_2021.09.1

https://github.com/datamol-org/datamol/blob/cd4561b7e8cdfb1c1ccc034116d6f900cc957b11/env.yml#L24

[ERROR] Runtime.ImportModuleError: Unable to import module 'app': No module named 'sascorer' Traceback (most recent call last):

[ERROR] Runtime.ImportModuleError: Unable to import module 'app': No module named 'sascorer' Traceback (most recent call last):

Posting so it's logged somewhere.

Is this a new dependency for datamol? Disclaimer, I didn't update datamol in my project (lambdomics) since a couple months (I was at 0.5 before upgrading). sascorer seems to have appear in commit 8576d4b with the descriptors module, and this module is imported by default when using import datamol as dm .

sascorer seems to be a module coming from RDKit but I'm not sure. Is it just a matter of making sure to import RDKit before datamol if you use both in your project?

datamol-io / datamol Goto Github PK

datamol's People

Contributors

Stargazers

Watchers

Forkers

datamol's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs