GithubHelp home page GithubHelp logo

datamol-io / datamol Goto Github PK

View Code? Open in Web Editor NEW
437.0 17.0 46.0 116.59 MB

Molecular Processing Made Easy.

Home Page: https://docs.datamol.io

License: Apache License 2.0

Python 97.20% Jupyter Notebook 2.80%
python molecule rdkit drug-discovery drug-design molecules cheminformatics medicinal-chemistry

datamol's People

Contributors

craigmichaelm avatar cwognum avatar deepsource-autofix[bot] avatar deepsourcebot avatar dessygil avatar dominvivo avatar hadim avatar ishan-kumar2 avatar kkovary avatar maclandrol avatar mercuryseries avatar michelml avatar pakman450 avatar sauravmaheshkar avatar shuyana avatar stwhitfield avatar therence1 avatar valence-jonnyhsu avatar zhu0619 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datamol's Issues

Options in `dm.to_df`

Include includePrivate, includeComputed from Mol.GetPropsAsDict(). Leave default the same as rdkit.

Enable multi versionning of the doc

The best would be to keep hosting the doc on GH. We can try to use the mkdocs mike plugin.

If that does not work, let's switch to readthedocs.

Add ESOL descriptor

def calc_ap(mol: dm.Mol):
    """Calculate aromatic proportion #aromatic atoms/#atoms total."""
    aromatic_query = dm.from_smarts("a")
    matches = mol.GetSubstructMatches(aromatic_query)
    return len(matches) / mol.GetNumAtoms()


def compute_esol(data: pd.DataFrame):
    """Compute esol from
    https://github.com/PatWalters/solubility/blob/d1536c58afe5e0e7ac4c96e2ffef496d5b98664b/esol.py

    Dataframe must have the following columns: clogp, mw, n_rotatable_bonds
    """

    intercept = 0.26121066137801696
    coef = {
        "mw": -0.0066138847738667125,
        "clogp": -0.7416739523408995,
        "n_rotatable_bonds": 0.003451545565957996,
        "ap": -0.42624840441316975,
    }

    # Compute aromatic proportion
    aromatic_proportion = data["mol"].apply(calc_ap)

    # Compute ESOL
    esol = (
        intercept
        + coef["clogp"] * data["clogp"]
        + coef["mw"] * data["mw"]
        + coef["n_rotatable_bonds"] * data["n_rotatable_bonds"]
        + coef["ap"] * aromatic_proportion
    )

    return esol

adapt to be used into the dm.descriptors module.

Fix types for path

In dm.io.read_sdf the type of urlpath is Union[str, pathlib.Path, TextIO] but it should be Union[str, os.PathLike, TextIO]. Same for others functions.

Missing `standardized_mol` function

The Quick API tour has the following call

mol = dm.standardized_mol(mol)

However it fails as datamol has no attribute standardized_mol.

I did not find any reference to standardized_mol in the code. Is it maybe a planned feature?

New rdkit version

Test the new version of rdkit and re-enable CI matrix once the repo is public.

Tutorials

Make more tutorials:

  • working with reactions

Use np.frombuffer to convert fp to array

from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import DataStructs

import numpy as np

mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")

fp_vec = rdMolDescriptors.GetMorganFingerprintAsBitVect(
    mol,
    radius=2,
    nBits=2048,
    useFeatures=True,
)
%%timeit -n 1000 -r 1000
# 10.1 µs ± 3.62 µs per loop (mean ± std. dev. of 1000 runs, 1000 loops each)
np.frombuffer(fp_vec.ToBitString().encode(), 'u1') - ord('0')
%%timeit -n 1000 -r 100
fp = np.zeros((0,), dtype=int)
39.6 µs ± 1.45 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each)
DataStructs.ConvertToNumpyArray(fp_vec, fp)

On my machine np.frombuffer is 4 times faster than ConvertToNumpyArray. This is a pretty quick benchmark so I might also be missing something.

Allow to set `total` in `dm.parallelized`

An error is raised when you provide total to tqdm_kwargs. It's convenient to be able to specify it when processing a very large dataframe for example since the most efficient way is to iterate on the rows (meaning the count of the iterator is not known).

Instead of:

rows = data.iterrows()
rows = list(rows)   # memory and CPU intensive
data = dm.parallelized(process_fn, rows, n_jobs=-1, progress=True, arg_type="args")

you would do:

data = dm.parallelized(process_fn, data.iterrows(), n_jobs=-1, progress=True, arg_type="args", total=len(data))

Fix conformers generation

smiles = "CCCC"
mol = dm.to_mol(smiles)
dm.conformers.generate(mol, n_confs=1, minimize_energy=True, rms_cutoff=None)

and

They both fails doing [mol.AddConformer(conf, assignId=True) for conf in confs] probably because RemoveAllConformers() is called just before.

It's a null pointer exception.

Support CXSmiles

Probably as a bool flag in dm.to_smiles and dm.to_mol? or in separated function.

More descriptors

import datamol as dm
from rdkit import Chem


def _in_range(x, min_val: float = -float("inf"), max_val: float = float("inf")):
    """Check if a value is in a range
    Args:
        x: value to check
        min_val: minimum value
        max_val: maximum value
    """
    return min_val <= x <= max_val


def _compute_ring_system(mol: dm.Mol, include_spiro: bool = True):
    """
    Compute the list of ring system in a molecule. This is based on RDKit's cookbook:
    https://www.rdkit.org/docs/Cookbook.html#rings-aromaticity-and-kekulization

    # EN: move to datamol

    Args:
        mol: input molecule
        include_spiro: whether to include spiro rings. Defaults to False.

    Returns:
        ring_system: list of ring system
    """
    ri = mol.GetRingInfo()
    systems = []
    for ring in ri.AtomRings():
        ringAts = set(ring)
        nSystems = []
        for system in systems:
            nInCommon = len(ringAts.intersection(system))
            if nInCommon and (include_spiro or nInCommon > 1):
                ringAts = ringAts.union(system)
            else:
                nSystems.append(system)
        nSystems.append(ringAts)
        systems = nSystems
    return systems


def _compute_charge(mol: dm.Mol):
    """
    Compute the charge of a molecule.

    Args:
        mol: input molecule

    Returns:
        charge: charge of the molecule
    """
    return Chem.rdmolops.GetFormalCharge(mol)


def _compute_refractivity(mol: dm.Mol):
    """
    Compute the refractivity of a molecule.

    Args:
        mol: input molecule

    Returns:
        mr: molecular refractivity of the molecule
    """
    return Chem.Crippen.MolMR(mol)


def _compute_rigid_bonds(mol: dm.Mol):
    """
    Compute the number of rigid bonds in a molecule.

    Args:
        mol: input molecule

    Returns:
        n_rigid_bonds: number of rigid bonds in the molecule
    """
    # rigid bonds are bonds that are not (single and not in rings)
    # I don't think this is the same as the number of non rotatable bonds ?
    # EN: move this to datamol ?
    non_rigid_bonds_count = dm.from_smarts("*-&!@*")
    n_rigid_bonds = mol.GetNumBonds() - len(
        mol.GetSubstructMatches(non_rigid_bonds_count)
    )
    return n_rigid_bonds


def _compute_n_stereo_center(mol: dm.Mol):
    """
    Compute the number of stereocenters in a molecule.

    Args:
        mol: input molecule

    Returns:
        n_stero_center: number of stereocenters in the molecule
    """
    n_stereo_center = 0
    try:
        Chem.FindPotentialStereo(mol, cleanIt=False)  # type: ignore
        n_stereo_center = Chem.rdMolDescriptors.CalcNumAtomStereoCenters(mol)
    except:
        pass
    return n_stereo_center


def _compute_n_charged_atoms(mol: dm.Mol):
    """
    Compute the number of charged atoms in a molecule.

    Args:
        mol: input molecule

    Returns:
        n_charged_atoms: number of charged atoms in the molecule
    """
    return sum([at.GetFormalCharge() != 0 for at in mol.GetAtoms()])

Unique molecule ID taking into account tautomerism

We are looking into generating a unique molecular ID that will give different IDs if two molecules are different tautomeric forms.

Features should be:

  • fast: we potentially want to apply it at large scale
  • consistent mol → ID: we should be able to recompute an ID from a given molecule (can't be random neither auto increment)
  • stable across python/rdkit and datamol versions: a unique ID should still be the same in a couple of years.
  • differentiate tautomeric forms

An elegant way would to use a non-standard inchi that would include a hydrogen layer /f. See https://en.wikipedia.org/wiki/International_Chemical_Identifier for details.

Here is a proof of concept that it works:

import datamol as dm

from rdkit import Chem

# SMILES: "N=C(N)O"
inchi3 = "InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/f/h2,4H,3H2/b2-1?"

# SMILES
inchi4 = "InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/f/h2-3H2"

mol3 = dm.from_inchi(inchi3)
mol4 = dm.from_inchi(inchi4)

print(dm.to_smiles(mol3))  # N=C(N)O
print(dm.to_smiles(mol4))  # NC(N)=O

# generated inchi and inchikey remain the same since the hydrogen layer has been removed

print(dm.to_inchi(mol3))  # 'InChI=1S/CH4N2O/c2-1(3)4/h(H4,2,3,4)'
print(dm.to_inchi(mol4))  # 'InChI=1S/CH4N2O/c2-1(3)4/h(H4,2,3,4)'

print(dm.to_inchikey(mol3))  # 'XSQUKJJJFZCRTK-UHFFFAOYSA-N'
print(dm.to_inchikey(mol4))  # 'XSQUKJJJFZCRTK-UHFFFAOYSA-N'


# when computing the inchikey directly from the non-standard inchi with the hydrogen layer
# those are different

print(Chem.inchi.InchiToInchiKey(inchi3))  # XSQUKJJJFZCRTK-ZIALIONUNA-N
print(Chem.inchi.InchiToInchiKey(inchi4))  # XSQUKJJJFZCRTK-UBUOBULFNA-N

image

But currently I haven't found a way to actually generate that hydrogen layer with rdkit. It seems to me that rdkit can only read the hydrogen layer and not generate it. I think this is purposely, so people stick to the standard inchi since using /f is non-standard.

To make it work we would have to find a way to compute the hydrogen layer, append it to the inchi and then use Chem.inchi.InchiToInchiKey(inchi3) to compute the inchikey from the inchi.


This is just a proposal, feel free to throw alternative ideas.

ping @maclandrol @craigmichaelm @zhu0619

Batch size does not work as expected

I also get another problem when using batch_size with the JobRunner.

My expectation is that, given a list of 1000 elements and batch_size=10, the job runner should automatically split the list into 100 sub-lists of 10 elements each. Then, I expect each sub-list to be processed sequentially.

However, instead of sequentially passing each element of the sub-list into the function, it passes the full sub-list. Thus, if the function doesn't handle lists, it crashes.

Maybe this is done on purpose, but I don't think it's the right behavior.

To reproduce:

import datamol as dm

def fun(a):
    print(a)

a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

dm.utils.parallelized(fun, a, batch_size=2)

# RESULT
[0, 1]
[2, 3]
[6, 7]
[8, 9]
[4, 5]

# EXPECTED (in any order)
0
1
3
4
6
7
5
2
9
8

Search capabilities: exact, substructure, similarity

I was wondering if a search api would be suited for datamol.

A starting API along the lines of:

def sim_search(
    mols: list[dm.Mol],
    qmol: list[dm.Mol],
    threshold: float = .7,
    fp_type: str = "ffp2"
 ) -> list[Tuple[dm.Mol, float]]:
    pass

def substruct_search(
    mols: list[dm.Mol],
    qmol: list[dm.Mol],
    atoms_and_bonds: bool = False,
 ) -> Union[list[dm.Mol], list[Tuple[dm.Mol, list[int], list[int]]]:
    pass

def exact_search(
    mols: list[dm.Mol],
    qmol: list[dm.Mol],
    unique: bool = True
 ) -> Union[dm.Mol, list[dm.Mol]]:
    pass

What do you think?

Better parallel

The current parallelized function is based on joblib which has pros but also cons. For example joblib tunes the batch size automatically which is nice.

The two major cons are: 1) it's not possible to simply call logger.info (it's probably possible with some tweaking) 2) exceptions are not propagated and so it makes debugging harder (you can always go back to n_jobs=1).

Here is a snippet that uses concurrent.futures as a backend instead of joblib:

from typing import Callable
from typing import Sequence
from typing import Any
from typing import Optional

from concurrent import futures
from loguru import logger


def parallelized(
    fn: Callable,
    inputs_list: Sequence[Any],
    scheduler: str = "processes",
    n_jobs: Optional[int] = -1,
    worker_id_key: Any = None,
    progress: bool = True,
    progress_auto: bool = True,
    log_when_done: bool = False,
):

    if not isinstance(inputs_list[0], dict):
        raise ValueError(f"inputs_list elements must be a dict not {type(inputs_list[0])}")

    if progress_auto:
        from tqdm.auto import tqdm
    else:
        from tqdm import tqdm

    if scheduler == "processes":
        pool_executor_cls = futures.ProcessPoolExecutor
        executor_kwargs = dict(max_workers=n_jobs)
        as_completed_fn = futures.as_completed

    elif scheduler == "threads":
        pool_executor_cls = futures.ThreadPoolExecutor
        executor_kwargs = dict(max_workers=n_jobs)
        as_completed_fn = futures.as_completed

    else:
        raise ValueError(f"Wrong scheduler: {scheduler}")

    if worker_id_key:
        worker_id_fn = lambda x: x[worker_id_key]
    else:
        worker_id_fn = lambda x: x

    with pool_executor_cls(**executor_kwargs) as executor:

        futures_list = {executor.submit(fn, **kwargs): kwargs for kwargs in inputs_list}

        results = []
        tqdm_args = {}
        tqdm_args["total"] = len(inputs_list)
        tqdm_args["disable"] = not progress

        for i, future in enumerate(tqdm(as_completed_fn(futures_list), **tqdm_args)):
            kwargs = futures_list[future]
            try:
                result = future.result()
                results.append(result)
            except Exception as ex:
                logger.info(f"Error running the following worker '{worker_id_fn(kwargs)}':", exc_info=1)  # type: ignore
                logger.info("Shutting down the execution...")
                executor.shutdown(True)
                raise ex
            else:
                if log_when_done:
                    logger.info(f"Execution done for '{worker_id_fn(kwargs)}' ({i+1}/{len(inputs_list)})")

    return results

It only supports kwargs as input type.

joblib also does a lot of useful magic under the hood that concurrent.futures is not doing so if we want to ship that function we could add it as dm.parallelized2 (or similar).

Add a `dm.utils.fs` module

A module based on fsspec.

I am a bit hesitant doing that TBH since it's out of the scope of datamol. But it would actually avoid a lot of fs module duplication we start to have internally and is also potentially useful to anyone using datamol since it often involves manipulating remote data (in the same spirit we already have dm.utils.parallelize).

The alternative will be to integrate such a module into fsspec. But it: 1) will take more time 2) probably will probably have to be modified in order to fit the fsspec one.

@maclandrol any strong opinion on this?

[ERROR] Runtime.ImportModuleError: Unable to import module 'app': No module named 'sascorer' Traceback (most recent call last):

[ERROR] Runtime.ImportModuleError: Unable to import module 'app': No module named 'sascorer' Traceback (most recent call last):

Posting so it's logged somewhere.

Is this a new dependency for datamol? Disclaimer, I didn't update datamol in my project (lambdomics) since a couple months (I was at 0.5 before upgrading). sascorer seems to have appear in commit 8576d4b with the descriptors module, and this module is imported by default when using import datamol as dm .

sascorer seems to be a module coming from RDKit but I'm not sure. Is it just a matter of making sure to import RDKit before datamol if you use both in your project?

Add binder badge

With default page to the basics tutorials with jlab 3 and nglview setup.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.