datamol-io / datamol Goto Github PK
View Code? Open in Web Editor NEWMolecular Processing Made Easy.
Home Page: https://docs.datamol.io
License: Apache License 2.0
Molecular Processing Made Easy.
Home Page: https://docs.datamol.io
License: Apache License 2.0
The most obvious is XC50
<-> pXC50
. I think it's within the scope of datamol but you could argue the opposite.
Let me know what you think @maclandrol
opts = StereoEnumerationOptions(unique=True)
We should backport the function from openff-toolkit for that:
Include includePrivate, includeComputed
from Mol.GetPropsAsDict()
. Leave default the same as rdkit.
This line of code raises an error if the element is not a sequence when using a batch size. This prevents the code from running using numpy arrays or tensors.
The best would be to keep hosting the doc on GH. We can try to use the mkdocs
mike plugin.
If that does not work, let's switch to readthedocs.
See also #21
Found this curation pipeline from chembl very useful and worth being a default curation step.
Paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7458899/#CR28
Code: https://github.com/chembl/ChEMBL_Structure_Pipeline
Do you think it's worth adding or reimplement in datamol?
See title.
def calc_ap(mol: dm.Mol):
"""Calculate aromatic proportion #aromatic atoms/#atoms total."""
aromatic_query = dm.from_smarts("a")
matches = mol.GetSubstructMatches(aromatic_query)
return len(matches) / mol.GetNumAtoms()
def compute_esol(data: pd.DataFrame):
"""Compute esol from
https://github.com/PatWalters/solubility/blob/d1536c58afe5e0e7ac4c96e2ffef496d5b98664b/esol.py
Dataframe must have the following columns: clogp, mw, n_rotatable_bonds
"""
intercept = 0.26121066137801696
coef = {
"mw": -0.0066138847738667125,
"clogp": -0.7416739523408995,
"n_rotatable_bonds": 0.003451545565957996,
"ap": -0.42624840441316975,
}
# Compute aromatic proportion
aromatic_proportion = data["mol"].apply(calc_ap)
# Compute ESOL
esol = (
intercept
+ coef["clogp"] * data["clogp"]
+ coef["mw"] * data["mw"]
+ coef["n_rotatable_bonds"] * data["n_rotatable_bonds"]
+ coef["ap"] * aromatic_proportion
)
return esol
adapt to be used into the dm.descriptors
module.
ping @craigmichaelm
In dm.io.read_sdf
the type of urlpath
is Union[str, pathlib.Path, TextIO]
but it should be Union[str, os.PathLike, TextIO]
. Same for others functions.
In dm.to_fp
, add a fp_type
argument in order to choose what fp you want to compute from http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html
so one column contains the rdkit mols (when loading from and sdf inject the molecule read by rdkit directly so conformers are preserved).
The RDKit SDF/MDL parser does not parse all atom properties specified using the MDL format.
Implementation resources:
I feel like the progress bar is filled upon job submission and not when a job has been terminated. I fixed that in a very early version of JobRunner
that wasn't relying on joblib
.
The Quick API tour has the following call
mol = dm.standardized_mol(mol)
However it fails as datamol
has no attribute standardized_mol
.
I did not find any reference to standardized_mol
in the code. Is it maybe a planned feature?
ping @MichelML @maclandrol
Test the new version of rdkit and re-enable CI matrix once the repo is public.
See https://github.com/datamol-org/datamol/security
Make more tutorials:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import DataStructs
import numpy as np
mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
fp_vec = rdMolDescriptors.GetMorganFingerprintAsBitVect(
mol,
radius=2,
nBits=2048,
useFeatures=True,
)
%%timeit -n 1000 -r 1000
# 10.1 µs ± 3.62 µs per loop (mean ± std. dev. of 1000 runs, 1000 loops each)
np.frombuffer(fp_vec.ToBitString().encode(), 'u1') - ord('0')
%%timeit -n 1000 -r 100
fp = np.zeros((0,), dtype=int)
39.6 µs ± 1.45 µs per loop (mean ± std. dev. of 100 runs, 1000 loops each)
DataStructs.ConvertToNumpyArray(fp_vec, fp)
On my machine np.frombuffer
is 4 times faster than ConvertToNumpyArray
. This is a pretty quick benchmark so I might also be missing something.
An error is raised when you provide total
to tqdm_kwargs
. It's convenient to be able to specify it when processing a very large dataframe for example since the most efficient way is to iterate on the rows (meaning the count of the iterator is not known).
Instead of:
rows = data.iterrows()
rows = list(rows) # memory and CPU intensive
data = dm.parallelized(process_fn, rows, n_jobs=-1, progress=True, arg_type="args")
you would do:
data = dm.parallelized(process_fn, data.iterrows(), n_jobs=-1, progress=True, arg_type="args", total=len(data))
smiles = "CCCC"
mol = dm.to_mol(smiles)
dm.conformers.generate(mol, n_confs=1, minimize_energy=True, rms_cutoff=None)
and
They both fails doing [mol.AddConformer(conf, assignId=True) for conf in confs]
probably because RemoveAllConformers()
is called just before.
It's a null pointer exception.
Probably as a bool flag in dm.to_smiles
and dm.to_mol
? or in separated function.
For large dataset. See https://gist.github.com/rtavenar/a4fb580ae235cc61ce8cf07878810567 for a snippet.
import datamol as dm
from rdkit import Chem
def _in_range(x, min_val: float = -float("inf"), max_val: float = float("inf")):
"""Check if a value is in a range
Args:
x: value to check
min_val: minimum value
max_val: maximum value
"""
return min_val <= x <= max_val
def _compute_ring_system(mol: dm.Mol, include_spiro: bool = True):
"""
Compute the list of ring system in a molecule. This is based on RDKit's cookbook:
https://www.rdkit.org/docs/Cookbook.html#rings-aromaticity-and-kekulization
# EN: move to datamol
Args:
mol: input molecule
include_spiro: whether to include spiro rings. Defaults to False.
Returns:
ring_system: list of ring system
"""
ri = mol.GetRingInfo()
systems = []
for ring in ri.AtomRings():
ringAts = set(ring)
nSystems = []
for system in systems:
nInCommon = len(ringAts.intersection(system))
if nInCommon and (include_spiro or nInCommon > 1):
ringAts = ringAts.union(system)
else:
nSystems.append(system)
nSystems.append(ringAts)
systems = nSystems
return systems
def _compute_charge(mol: dm.Mol):
"""
Compute the charge of a molecule.
Args:
mol: input molecule
Returns:
charge: charge of the molecule
"""
return Chem.rdmolops.GetFormalCharge(mol)
def _compute_refractivity(mol: dm.Mol):
"""
Compute the refractivity of a molecule.
Args:
mol: input molecule
Returns:
mr: molecular refractivity of the molecule
"""
return Chem.Crippen.MolMR(mol)
def _compute_rigid_bonds(mol: dm.Mol):
"""
Compute the number of rigid bonds in a molecule.
Args:
mol: input molecule
Returns:
n_rigid_bonds: number of rigid bonds in the molecule
"""
# rigid bonds are bonds that are not (single and not in rings)
# I don't think this is the same as the number of non rotatable bonds ?
# EN: move this to datamol ?
non_rigid_bonds_count = dm.from_smarts("*-&!@*")
n_rigid_bonds = mol.GetNumBonds() - len(
mol.GetSubstructMatches(non_rigid_bonds_count)
)
return n_rigid_bonds
def _compute_n_stereo_center(mol: dm.Mol):
"""
Compute the number of stereocenters in a molecule.
Args:
mol: input molecule
Returns:
n_stero_center: number of stereocenters in the molecule
"""
n_stereo_center = 0
try:
Chem.FindPotentialStereo(mol, cleanIt=False) # type: ignore
n_stereo_center = Chem.rdMolDescriptors.CalcNumAtomStereoCenters(mol)
except:
pass
return n_stereo_center
def _compute_n_charged_atoms(mol: dm.Mol):
"""
Compute the number of charged atoms in a molecule.
Args:
mol: input molecule
Returns:
n_charged_atoms: number of charged atoms in the molecule
"""
return sum([at.GetFormalCharge() != 0 for at in mol.GetAtoms()])
We are looking into generating a unique molecular ID that will give different IDs if two molecules are different tautomeric forms.
Features should be:
An elegant way would to use a non-standard inchi that would include a hydrogen layer /f
. See https://en.wikipedia.org/wiki/International_Chemical_Identifier for details.
Here is a proof of concept that it works:
import datamol as dm
from rdkit import Chem
# SMILES: "N=C(N)O"
inchi3 = "InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/f/h2,4H,3H2/b2-1?"
# SMILES
inchi4 = "InChI=1/CH4N2O/c2-1(3)4/h(H4,2,3,4)/f/h2-3H2"
mol3 = dm.from_inchi(inchi3)
mol4 = dm.from_inchi(inchi4)
print(dm.to_smiles(mol3)) # N=C(N)O
print(dm.to_smiles(mol4)) # NC(N)=O
# generated inchi and inchikey remain the same since the hydrogen layer has been removed
print(dm.to_inchi(mol3)) # 'InChI=1S/CH4N2O/c2-1(3)4/h(H4,2,3,4)'
print(dm.to_inchi(mol4)) # 'InChI=1S/CH4N2O/c2-1(3)4/h(H4,2,3,4)'
print(dm.to_inchikey(mol3)) # 'XSQUKJJJFZCRTK-UHFFFAOYSA-N'
print(dm.to_inchikey(mol4)) # 'XSQUKJJJFZCRTK-UHFFFAOYSA-N'
# when computing the inchikey directly from the non-standard inchi with the hydrogen layer
# those are different
print(Chem.inchi.InchiToInchiKey(inchi3)) # XSQUKJJJFZCRTK-ZIALIONUNA-N
print(Chem.inchi.InchiToInchiKey(inchi4)) # XSQUKJJJFZCRTK-UBUOBULFNA-N
But currently I haven't found a way to actually generate that hydrogen layer with rdkit. It seems to me that rdkit can only read the hydrogen layer and not generate it. I think this is purposely, so people stick to the standard inchi since using /f
is non-standard.
To make it work we would have to find a way to compute the hydrogen layer, append it to the inchi and then use Chem.inchi.InchiToInchiKey(inchi3)
to compute the inchikey from the inchi.
This is just a proposal, feel free to throw alternative ideas.
It will allow plotting SMILES or SMARTS without a correct aromaticity perception to be displayed without throwing an error.
I also get another problem when using batch_size
with the JobRunner
.
My expectation is that, given a list of 1000 elements and batch_size=10
, the job runner should automatically split the list into 100 sub-lists of 10 elements each. Then, I expect each sub-list to be processed sequentially.
However, instead of sequentially passing each element of the sub-list into the function, it passes the full sub-list. Thus, if the function doesn't handle lists, it crashes.
Maybe this is done on purpose, but I don't think it's the right behavior.
To reproduce:
import datamol as dm
def fun(a):
print(a)
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
dm.utils.parallelized(fun, a, batch_size=2)
# RESULT
[0, 1]
[2, 3]
[6, 7]
[8, 9]
[4, 5]
# EXPECTED (in any order)
0
1
3
4
6
7
5
2
9
8
What do you think @maclandrol ?
I was wondering if a search api would be suited for datamol.
A starting API along the lines of:
def sim_search(
mols: list[dm.Mol],
qmol: list[dm.Mol],
threshold: float = .7,
fp_type: str = "ffp2"
) -> list[Tuple[dm.Mol, float]]:
pass
def substruct_search(
mols: list[dm.Mol],
qmol: list[dm.Mol],
atoms_and_bonds: bool = False,
) -> Union[list[dm.Mol], list[Tuple[dm.Mol, list[int], list[int]]]:
pass
def exact_search(
mols: list[dm.Mol],
qmol: list[dm.Mol],
unique: bool = True
) -> Union[dm.Mol, list[dm.Mol]]:
pass
What do you think?
The current parallelized
function is based on joblib
which has pros but also cons. For example joblib
tunes the batch size automatically which is nice.
The two major cons are: 1) it's not possible to simply call logger.info
(it's probably possible with some tweaking) 2) exceptions are not propagated and so it makes debugging harder (you can always go back to n_jobs=1
).
Here is a snippet that uses concurrent.futures
as a backend instead of joblib
:
from typing import Callable
from typing import Sequence
from typing import Any
from typing import Optional
from concurrent import futures
from loguru import logger
def parallelized(
fn: Callable,
inputs_list: Sequence[Any],
scheduler: str = "processes",
n_jobs: Optional[int] = -1,
worker_id_key: Any = None,
progress: bool = True,
progress_auto: bool = True,
log_when_done: bool = False,
):
if not isinstance(inputs_list[0], dict):
raise ValueError(f"inputs_list elements must be a dict not {type(inputs_list[0])}")
if progress_auto:
from tqdm.auto import tqdm
else:
from tqdm import tqdm
if scheduler == "processes":
pool_executor_cls = futures.ProcessPoolExecutor
executor_kwargs = dict(max_workers=n_jobs)
as_completed_fn = futures.as_completed
elif scheduler == "threads":
pool_executor_cls = futures.ThreadPoolExecutor
executor_kwargs = dict(max_workers=n_jobs)
as_completed_fn = futures.as_completed
else:
raise ValueError(f"Wrong scheduler: {scheduler}")
if worker_id_key:
worker_id_fn = lambda x: x[worker_id_key]
else:
worker_id_fn = lambda x: x
with pool_executor_cls(**executor_kwargs) as executor:
futures_list = {executor.submit(fn, **kwargs): kwargs for kwargs in inputs_list}
results = []
tqdm_args = {}
tqdm_args["total"] = len(inputs_list)
tqdm_args["disable"] = not progress
for i, future in enumerate(tqdm(as_completed_fn(futures_list), **tqdm_args)):
kwargs = futures_list[future]
try:
result = future.result()
results.append(result)
except Exception as ex:
logger.info(f"Error running the following worker '{worker_id_fn(kwargs)}':", exc_info=1) # type: ignore
logger.info("Shutting down the execution...")
executor.shutdown(True)
raise ex
else:
if log_when_done:
logger.info(f"Execution done for '{worker_id_fn(kwargs)}' ({i+1}/{len(inputs_list)})")
return results
It only supports kwargs
as input type.
joblib
also does a lot of useful magic under the hood that concurrent.futures
is not doing so if we want to ship that function we could add it as dm.parallelized2
(or similar).
See https://greglandrum.github.io/rdkit-blog/conformers/exploration/2021/01/31/looking-at-random-coordinate-embedding.html for the rational.
Wht do you guys think? @maclandrol @JDavid04
Based on manu's gist: https://gist.github.com/maclandrol/dd5695299107c1e167d6d1f671a786d1
A module based on fsspec
.
I am a bit hesitant doing that TBH since it's out of the scope of datamol. But it would actually avoid a lot of fs
module duplication we start to have internally and is also potentially useful to anyone using datamol
since it often involves manipulating remote data (in the same spirit we already have dm.utils.parallelize
).
The alternative will be to integrate such a module into fsspec
. But it: 1) will take more time 2) probably will probably have to be modified in order to fit the fsspec
one.
@maclandrol any strong opinion on this?
It's probably easy to do as there is a couple of independent for loop in the function.
see https://github.com/datamol-org/datamol/community . Missing are:
It's mol: str,
while it should be mol: Union[str, dm.Mol],
.
An update for the viz.to_image
was introduced in 0.6.0. But there was this rdkit bug: rdkit/rdkit#3101 that was only fixed in Release_2021.09.1
https://github.com/datamol-org/datamol/blob/cd4561b7e8cdfb1c1ccc034116d6f900cc957b11/env.yml#L24
[ERROR] Runtime.ImportModuleError: Unable to import module 'app': No module named 'sascorer' Traceback (most recent call last):
Posting so it's logged somewhere.
Is this a new dependency for datamol? Disclaimer, I didn't update datamol in my project (lambdomics) since a couple months (I was at 0.5 before upgrading). sascorer
seems to have appear in commit 8576d4b with the descriptors module, and this module is imported by default when using import datamol as dm
.
sascorer seems to be a module coming from RDKit but I'm not sure. Is it just a matter of making sure to import RDKit before datamol if you use both in your project?
Same as dm.utils.fs.copy
but for files (we should use dm.utils.fs.copy
under the hood.
As an option.
see #93 (comment) for context
With default page to the basics tutorials with jlab 3 and nglview setup.
Using JobRunner
only allow propagating joblib
kwargs
. It's currently not possible to propagate tqdm
kwargs. Might be useful for args such as leave
or desc
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.