hachmannlab / chemml Goto Github PK

View Code? Open in Web Editor NEW

155.0 14.0 29.0 86.45 MB

ChemML is a machine learning and informatics program suite for the chemical and materials sciences.

Home Page: https://hachmannlab.github.io/chemml

License: BSD 3-Clause "New" or "Revised" License

Python 99.89% Jupyter Notebook 0.11%

quantum-mechanics materials-informatics data-science machine-learning drug-discovery deep-learning

chemml's Introduction

ChemML

ChemML is a machine learning and informatics program suite for the analysis, mining, and modeling of chemical and materials data. Please check the ChemML website for more information.

ChemML documentation: https://hachmannlab.github.io/chemml

Code Design:

ChemML is developed in the Python 3 programming language and makes use of a host of data analysis and ML libraries(accessible through the Anaconda distribution), as well as domain-specific libraries. The development follows a strictly modular and object-oriented design to make the overall code as flexible and versatile as possible.

The format of library is similar to the well known libraries like Scikit-learn.

Latest Version:

to find out about the latest version and release history, click here

Installation and Dependencies:

You can download ChemML from PyPI via pip.

pip install chemml --user -U

Here is a list of external libraries that will be installed with chemml:

numpy
pandas
tensorflow
scikit-learn
matplotlib
seaborn
lxml
openpyxl
ipywidgets

We strongly recommend you to install ChemML in an Anaconda environment. The instructions to create the environment, install ChemML’s dependencies, and subsequently install Chemml using the Python Package Index (PyPI) via pip are as follows:

conda create --name chemml_env python=3.8
source activate chemml_env
conda install -c conda-forge openbabel rdkit nb_conda_kernels python-graphviz
pip install chemml

Citation:

Please cite the use of ChemML as:

Main citation:

@article{haghighatlari2019chemml,
    title        = {{ChemML}: A machine learning and informatics program package for the analysis, mining, and modeling of chemical and materials data},
    author       = {Haghighatlari, Mojtaba and Vishwakarma, Gaurav and Altarawy, Doaa and Subramanian, Ramachandran and Kota, Bhargava U and Sonpal, Aditya and Setlur, Srirangaraj and Hachmann, Johannes},
    year         = 2020,
    journal      = {Wiley Interdisciplinary Reviews: Computational Molecular Science},
    publisher    = {Wiley Online Library},
    volume       = 10,
    doi          = {https://doi.org/10.1002/wcms.1458},
    pages        = {e1458},
}


Other references:

@article{chemml_review2019,
author = {Haghighatlari, Mojtaba and Hachmann, Johannes},
doi = {https://doi.org/10.1016/j.coche.2019.02.009},
issn = {2211-3398},
journal = {Current Opinion in Chemical Engineering},
month = {jan},
pages = {51--57},
title = {Advances of machine learning in molecular modeling and simulation},
volume = {23},
year = {2019}
}

@article{Hachmann2018,
author = {Hachmann, Johannes and Afzal, Mohammad Atif Faiz and Haghighatlari, Mojtaba and Pal, Yudhajit},
doi = {10.1080/08927022.2018.1471692},
issn = {10290435},
journal = {Molecular Simulation},
number = {11},
pages = {921--929},
title = {Building and deploying a cyberinfrastructure for the data-driven design of chemical systems and the exploration of chemical space},
volume = {44},
year = {2018}
}

@article{vishwakarma2019towards,
title={Towards autonomous machine learning in chemistry via evolutionary algorithms},
author={Vishwakarma, Gaurav and Haghighatlari, Mojtaba and Hachmann, Johannes},
journal={ChemRxiv preprint},
year={2019}
}

License:

ChemML is copyright (C) 2014-2022 Johannes Hachmann and Mojtaba Haghighatlari, Aditya Sonpal, Gaurav Vishwakarma and Aatish Pradhan all rights reserved. ChemML is distributed under 3-Clause BSD License (https://opensource.org/licenses/BSD-3-Clause).

About us:

Maintainers:

- Johannes Hachmann, [email protected]
- Mojtaba Haghighatlari
- Aditya Sonpal, [email protected]
- Aatish Pradhan, [email protected]
University at Buffalo - The State University of New York (UB)

Contributors:

- Doaa Altarawy (MolSSI): scientific advice and software mentor 
- Aditya Sonpal (UB): graph convolution NNs, XAI
- Aatish Pradhan (UB): autoML and Jupyter GUI developer
- Gaurav Vishwakarma (UB): automated model optimization
- Ramachandran Subramanian (UB): Magpie descriptor library port
- Bhargava Urala Kota (UB): library database
- Srirangaraj Setlur (UB): scientific advice
- Venugopal Govindaraju (UB): scientific advice
- Krishna Rajan (UB): scientific advice


- We encourage any contributions and feedback. Feel free to fork and make pull-request to the "development" branch.

Acknowledgements:

- ChemML is based upon work supported by the U.S. National Science Foundation under grant #OAC-1751161 and in part by #OAC-1640867.
- ChemML was also supported by start-up funds provided by UB's School of Engineering and Applied Science and UB's Department of Chemical and Biological Engineering, the New York State Center of Excellence in Materials Informatics through seed grant #1140384-8-75163, and the U.S. Department of Energy under grant #DE-SC0017193.
- Mojtaba Haghighatlari received 2018 Phase-I and 2019 Phase-II Software Fellowships by the Molecular Sciences Software Institute (MolSSI) for his work on ChemML.

chemml's People

Contributors

Stargazers

Watchers

chemml's Issues

chemML cant read XYZ format as string

My molecules are stored in XYZ format in a pandas DataFrame and I'm trying to iterate them to chemml, It looks like I can not pass a string directly and I was with the impresison that this was possible.

Here is what I'm trying:
My molecule (just a test):

Ethane = """8
	Energy:      -4.7343653
C          0.10289       -0.52365       -0.00000
C         -1.40917       -0.52366       -0.00000
H          0.48726        0.04224       -0.85384
H          0.48726       -1.54605       -0.06316
H          0.48726       -0.06716        0.91700
H         -1.79354       -0.98015       -0.91700
H         -1.79354        0.49874        0.06316
H         -1.79354       -1.08955        0.85384
"""

And here is the raised error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-ebbc3a0998bf> in <module>()
----> 1 mol = Molecule(Ethane, input_type="xyz")

/home/henrique/.local/lib/python3.6/site-packages/chemml/chem/molecule.py in __init__(self, input, input_type, **kwargs)
    275         self._init_attributes()
    276         self._extra_docs()
--> 277         self._load(input, input_type, **kwargs)
    278 
    279     def __repr__(self):

/home/henrique/.local/lib/python3.6/site-packages/chemml/chem/molecule.py in _load(self, input, input_type, **kwargs)
    422         """
    423         if input_type == 'xyz':
--> 424             self._load_pybel(input, input_type, **kwargs)
    425         elif input_type in ['smiles', 'smarts', 'inchi']:
    426             self._load_rdkit(input, input_type, **kwargs)

/home/henrique/.local/lib/python3.6/site-packages/chemml/chem/molecule.py in _load_pybel(self, input, input_type, **kwargs)
    495             else:
    496                 msg = "The input '%s' is not a valid XYZ input file."%input
--> 497                 raise ValueError(msg)
    498 
    499         if pybel_mol is None:

ValueError: The input '8
	Energy:      -4.7343653
C          0.10289       -0.52365       -0.00000
C         -1.40917       -0.52366       -0.00000
H          0.48726        0.04224       -0.85384
H          0.48726       -1.54605       -0.06316
H          0.48726       -0.06716        0.91700
H         -1.79354       -0.98015       -0.91700
H         -1.79354        0.49874        0.06316
H         -1.79354       -1.08955        0.85384
' is not a valid XYZ input file.

If, instead, I'm providing a real XYZ file instead of a string (with path end everything) it works.

from chemml import wrapperRUN

Hi,

When I tried "from chemml import wrapperRUN" in The Jupyter interpreter, I get the following error message:

ImportError Traceback (most recent call last)
in
----> 1 from chemml import wrapperRUN

ImportError: cannot import name 'wrapperRUN' from 'chemml' (/Users/amir/.local/lib/python3.8/site-packages/chemml/init.py)

How can resolve this issue?

Thanks,
Amir

Wrong import of pybel

In the module chemml.chem.magpie_python.CompositionEntry there is one line saying "import pybel". However, in the new version of openbabel, this syntax is no longer supported and it must read "from openbabel import pybel" insteaf, see here: https://open-babel.readthedocs.io/en/latest/UseTheLibrary/migration.html

can we save a trained active learning model?

do chemml implement the module to save and load the trained model in some way?

TypeError when loading dataset

I am receiving an error when trying to load load_organic_density (ChemML v 1.2)

from chemml.datasets import load_organic_density
smi, density, features = load_organic_density()

Here is the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Untitled-1.ipynb Cell 23 line 3
      [1](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#X31sdW50aXRsZWQ%3D?line=0) from chemml.datasets import load_organic_density
----> [3](vscode-notebook-cell:Untitled-1.ipynb?jupyter-notebook#X31sdW50aXRsZWQ%3D?line=2) smi, density, features = load_organic_density()

File [~\AppData\Roaming\Python\Python310\site-packages\chemml\datasets\base.py:87](https://untitled+.vscode-resource.vscode-cdn.net/~/AppData/Roaming/Python/Python310/site-packages/chemml/datasets/base.py:87), in load_organic_density()
     85 smi = pd.DataFrame(df['smiles'], columns=['smiles'])
     86 density = pd.DataFrame(df['density_Kg/m3'], columns=['density_Kg/m3'])
---> 87 features = df.drop(['smiles', 'density_Kg/m3'],1)
     89 return smi, density, features

TypeError: DataFrame.drop() takes from 1 to 2 positional arguments but 3 were given

Dependency Issue in Installation

Dear contributors,

When I followed the instructions of installing with anaconda:
conda create --name chemml_env python=3.8
source activate chemml_env
conda install -c conda-forge openbabel rdkit nb_conda_kernels python-graphviz
pip install chemml

I got the following dependency error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

sqlalchemy 2.0.27 requires typing-extensions>=4.6.0, but you have typing-extensions 4.5.0 which is incompatible.
Successfully installed MarkupSafe-2.1.5 absl-py-2.1.0 appnope-0.1.4 asttokens-2.4.1 astunparse-1.6.3 backcall-0.2.0 cachetools-5.3.2 charset-normalizer-3.3.2 chemml-1.3 comm-0.2.1 decorator-5.1.1 et-xmlfile-1.1.0 executing-2.0.1 flatbuffers-23.5.26 future-0.18.3 gast-0.4.0 google-auth-2.28.0 google-auth-oauthlib-1.0.0 google-pasta-0.2.0 grpcio-1.60.1 h5py-3.10.0 idna-3.6 ipython-8.12.3 ipywidgets-8.1.2 jedi-0.19.1 joblib-1.3.2 jupyterlab-widgets-3.0.10 keras-2.13.1 libclang-16.0.6 lxml-5.1.0 markdown-3.5.2 matplotlib-inline-0.1.6 numpy-1.24.3 oauthlib-3.2.2 openpyxl-3.1.2 opt-einsum-3.3.0 parso-0.8.3 pexpect-4.9.0 pickleshare-0.7.5 prompt-toolkit-3.0.43 protobuf-4.25.3 ptyprocess-0.7.0 pure-eval-0.2.2 pyasn1-0.5.1 pyasn1-modules-0.3.0 pygments-2.17.2 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 scikit-learn-1.3.2 scipy-1.10.1 seaborn-0.13.2 stack-data-0.6.3 tensorboard-2.13.0 tensorboard-data-server-0.7.2 tensorflow-2.13.1 tensorflow-estimator-2.13.0 tensorflow-io-gcs-filesystem-0.34.0 termcolor-2.4.0 threadpoolctl-3.3.0 typing-extensions-4.5.0 urllib3-2.2.1 wcwidth-0.2.13 werkzeug-3.0.1 wget-3.2 widgetsnbextension-4.0.10 wrapt-1.16.0

And when I tried to run magpie feature module I got:

2024-02-19 01:04:32.630354: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "/Users/yanjunliu/Documents/3DSC/magpie_feature.py", line 39, in
print(MAGPIE_features('NaCl'))
File "/Users/yanjunliu/Documents/3DSC/magpie_feature.py", line 20, in MAGPIE_features
features = f.generate_features(entries=all_comps)
File "/Users/yanjunliu/opt/anaconda3/envs/3DSC/lib/python3.8/site-packages/chemml/chem/magpie_python/attributes/generators/composition/ValenceShellAttributeGenerator.py", line 64, in generate_features
n_valence[i] = LookUpData.load_property("N"+s+"Valence")
File "/Users/yanjunliu/opt/anaconda3/envs/3DSC/lib/python3.8/site-packages/chemml/chem/magpie_python/data/materials/util/LookUpData.py", line 161, in load_property
values = np.zeros(len(self.element_ids), dtype=np.float)
File "/Users/yanjunliu/opt/anaconda3/envs/3DSC/lib/python3.8/site-packages/numpy/init.py", line 305, in getattr
raise AttributeError(former_attrs[attr])
AttributeError: module 'numpy' has no attribute 'float'.
np.float was a deprecated alias for the builtin float. To avoid this error in existing code, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Are these errors related? I just need the related functions in chemml to generate magpie features. Thank you very much!

Best regards,

Yanjun Liu

select the test set for the Active Learning manually

At the moment the EMC active learning package initializes the training and test set randomly using tr_ind, te_ind = al.initialize(). Is it possible to initialize the test set manually?

Z-Score

Hello,

I am wondering wether I can apply your Z-Score analysis. I tried to find it in this repository but no luck.

Thank you so much,

Can't find pybel despyte it being installed and working

I'm unable to load chemml after following install instructions:

from openbabel import pybel, openbabel
from chemml.chem import Molecule

The error message:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-8e1fb2bc2fb0> in <module>()
      1 from openbabel import pybel, openbabel
----> 2 from chemml.chem import Molecule

/home/henrique/.local/lib/python3.6/site-packages/chemml/chem/__init__.py in <module>()
      9 """
     10 
---> 11 from .molecule import Molecule
     12 from .molecule import XYZ
     13 from .CoulMat import CoulombMatrix

/home/henrique/.local/lib/python3.6/site-packages/chemml/chem/molecule.py in <module>()
      3 import keras # required to go around the protobuf error after importing pybel prior to tensorflow
      4 from rdkit import Chem
----> 5 import pybel
      6 from rdkit.Chem import AllChem
      7 import warnings

ModuleNotFoundError: No module named 'pybel'

Directly feed ChemML with a RDKit molecule (object)

Hello, thanks for the amazing code!
I'm working with a large pandas dataframe containing organometallic molecules in the .mol V3000 format. RDKit can read it perfectly but as I can see ChemML, using the Molecule function, can read smiles, smarts, inchi, and xyz. For my molecules, I can't convert them to one line notations because that removes the correct geometry of the molecules, so, here is my question:

Since RDKit is reading the molecules correctly, is there a way to feed ChemML with the molecule objects already interpreted by RDKit?

GeneticAlgorithm function fitness argument does not check number of inputs

The number of elements in the fitness tuple is not checked during GeneticAlgorithm function execution.

Example: when performing single objective minimization, the function is expected to work like

GeneticAlgorithm(..., fitness=(min,), ...)

But it can also work as

GeneticAlgorithm(..., fitness=('min', 'min', 'min', ...), ...)

which would be more appropriate for multi-objective.

The function will still execute and return the single objective