paucablop / chemotools Goto Github PK

Integrate your chemometric tools with the scikit-learn API 🧪 🤖

Home Page: https://paucablop.github.io/chemotools/

License: MIT License

Python 99.98% Ruby 0.01% TeX 0.02%

artificial-intelligence autoencoders chemometrics deep-learning hacktoberfest ir-spectroscopy machine-learning multivariate-analysis nir-spectroscopy python raman-spectroscopy scikit-learn sklearn spectra spectroscopy

chemotools's Introduction

Hi there 👋 I am Pau Cabaneros (paucablop)!

👀 I love modelling biological 🧬 and chemical ⚛ systems to better understand how they can be used to help society 🌍

🔭 I am currently working at as a process analytical technology scientist and data scientist in the biotech industry 👨‍🔬

♥ I enjoy open source projects and sharing knowledge through code 📚

🌱 I am working to improve my devops and automation skills 🤹‍♀️

ResearchGate Profile

chemotools's People

Contributors

Stargazers

Watchers

Forkers

pauljw28 irunikze tustudents gardenialover anna-ngo-1995 ugaaifoodsystems

chemotools's Issues

SPC file format reader

Are there any plans to include an SPC file format reader? Or can you recommend a different Python OSS library that can be used with chemotools?

Improve perfomance ow whittaker

Use sparse algebra to improve the performance of the filter

Improve compatability with the ```set_output``` API from ```scikit-learn```

Description

All the transformers from chemotools are compatible with scikit-learn, that is the objective of chemotools 👍. In one of the most recent releases of scikit-learn they have introduced the set_output API, which basically allows the user to define an pandas as output. This will produce a pandas.DataFrame object as output instead of the default numpy.ndarray. This works fine with most of chemotools transformers, but I have some specific issues:

👉 The column names are lost after the transformation

When I use a chemotools transformer setup to produce a pandas.DataFrame, it does not keep the column names, and produces an output without column names. I have compared the functionality with other scikit-learn transformers (such as StandardScaler(), and I have seen that they do keep the column names in the output.

👉 The API does not work when the transformer reduces the number of features

Some transformers will reduce the number of features on our dataset (e.g., will select a subset of columns from it). These are under the variable selection transformers. I don't really know how to fix this issue.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve compatibility with the set_output API

How to Contribute

Here is the contributing guidelines

Contact

[We can have the the conversation in the Issue or the Discussion](#45)

Resources

👉 Link to set_output API form scikit-learn

👉Link to problem description

Savitzky Golay

Normalize by min/max

L-Normalize

🧰 Add initial `Makefile` version and tiny test CI pipeline

With a simple Makefile commands like make test could be used to trigger all the tests with the correct specifiction of arguments instead of a very long command like pytest --cov=chemotools .\tests -n=auto --cov-report html -x.
Another example would be make install-dev for python -m install --upgrade pip setuptools wheel -r .\requirements-dev.txt ..

As an added bonus, that would enable the setup of simple test CI pipeline that runs tests whenever a pull request on main is opened and changed. Ideally, the CI pipeline would test multiple Python versions at the same time.

Implement variable of importance for projections and selectioc raito

Include variables of importance for projections (only works with PLS like models)
Include selectivity ratio (only works for PLS like models)

Improve `WhittakerSmooth`, `AirPLS`, and `ArPLS` performance - sparse matrix operations

Description

AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the csc_matrix representation from scipy.sparse to optimize performance, but further improvements are needed.

Improving Attempts

To improve the performance, I have tried just-in-time compilation of some key functions using numba. However, numba does not support the csc_matrix type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for a numba compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms

Here are some ideas to work on:

Find a more efficient way to JIT compile the code using tools like numba.
Investigate parallel or distributed computing techniques to speed up the processing of multiple spectra.

How to Contribute

Here is the contributing guidelines

Contact

We can have the the conversation in the Issue or the Discussion

Resources

Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:

Paper on AirPLS: Z.-M. Zhang, S. Chen, and Y.-Z. Liang, Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135 (5), 1138-1146 (2010).
Paper on ArPLS: Sung-June Baek, Aaron Park, Young-Jin Ahn, Jaebum Choo Baseline correction using asymmetrically reweighted penalized least squares smoothing

Whitakker smoothing

Remove tuple in input of baseline correction instances

Harmonize naming convention with other libraries for better LLM integration

Some of the functions implemented in chemotools, are similar to other functions available in other libraries. For example, the savitzky golay filter, which is inherited from scipy. The way to call the arguments from of the Savitzky Golay method from chemotools, should be the same as in scipy, so that LLM have a larger chance to make the right suggestions.

Standard Normal Variate

Linear correction

ModuleNotFoundError when loading from chemotools.variable_selection import RangeCut

I wanted to try your Coffee Spectra Classifier exercise in google colab but when I run the line

from chemotools.variable_selection import RangeCut

I receive the following error.

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 7>()
5 from chemotools.scatter import StandardNormalVariate
6
----> 7 from chemotools.variable_selection import RangeCut
8 pipeline = make_pipeline(
9 StandardNormalVariate(),

ModuleNotFoundError: No module named 'chemotools.variable_selection'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Can you please help?
Thank you

Norris Williams

Add robust normal variate

Extended Multiplicative Scatter Correction

Range Cut by Wavenumbers

Substitute check_array by ._validate_data()

Right now, the input in each method is checked by a custom function called check_input. This function uses the check_array() function from scikit-learn.utils. However, is a better practice to use the ._validate_data() inherited from the BaseEstimator class. in addition, the .validate_data() will also set the n_features_in, so we do not have to do it explicitly, making the code more readable.

Adaptive Iterative Reweighted Penalized Least Squares

SelectByIndices

Add docstrings

Add docstrings to the following methods:

Scatter

Multiplicative signal correction
Standard normal variate

Baseline

Derivative

Norris william
Savitzky golay

Scale

Index scaler
Min max scaler
norm scaler

Smooth

Mean filter
Median filter
Savitzky golay filter
Whittaker smooth

Variable selection

Range cut

Savitzky Golay Filter

Provide Polars support when scikit-learn 1.4 is released

Factor sklearn cheks out of each function

Try to place all the sklearn checks out of the main body of each function to avoid too much repeated code 😎😎

Add data Augmentation module

Knowledge based transformation for data augmentation

Implement Weighted Least Squares

Add Weighted Least Squares preprocessing method for baseline correction:

https://pubs.rsc.org/en/content/articlehtml/2015/an/c4an01061b

Add `pybaseline` support

🚶➡️🏃 Proposed Enhancement

We could add a super general purpose estimator that integrates the pybaseline package for elaborate baseline correction. The package is great and offers a variety of baseline algorithms, among them erPLS for automated selection of the smoothing parameter.
However, it is purely function-based and this is where chemotools comes in.

🧑‍💻 Implementation Details

If we keep a general estimator like

PyBaselineCorrection(algorithm, algorithm_kwargs)

we could integrate the full package with just a single estimator.
We can make the algorithm_kwargs dictionary- and also dataclass-based because the latter does not require the user to look at two documentations of two packages at the same time 🤔 In case we like an algorithm a lot, we can consider transitioning this to its own dedicated estimator.
It will be a lof of tedious copying of specifications, but if that enables a great feature, why not?

Enable automated window size determination of Savitzky-Golay and Mean filter (maybe also Median filter)

🚶➡️🏃 Proposed Enhancement

Today, I answered this StackExchange Question on the Savitzky-Golay filter and thereby I figured out that the automated selection of the Savitzky-Golay (and thus the mean filter because it is the Savitzky-Golay filter with polynomial degree 0 😱) is easily possible via Cross Validation.
After the WhittakerSmooth might get a facelift (#120) with that enables automated smoothing, I think this is a natural way to proceed. If we trick a bit, actually also the median filter can get this update, but the numerics might be a bit more involved.

Improve speed in ArPls()

Splines baseline correction

Use of TensorLy for N-way data?

Would anyone be interested in adapting functions from TensorLy with a more chemometrics-esque API?

Add subtraction of reference spectrum

Median filter

Non Negative

Add logger to the different functions

Multiplicative Scatter Correction

Migrate to pyptoject.toml

Migrate project packaging configuration from setup.py to pyproject.toml

Enable set output API

Improve feature selection integration with sklearn API

Inherit SelectorMixin class from sklearn
Rewrite current selection methos

Use plotly for html plots in documentation 😎

plotly plots are cooler to display 😎 😎

Mean filter

Implement successor of Savitzky-Golay: The Modified Sinc Smoother

🚶➡️🏃 Proposed Enhancement

In 2023, some researchers put the Savitzky-Golay-filter to the test and figured out that - despite its wide-spread use - it is actually far from being a good smoother.
Their publication Why and How Savitzky–Golay Filters Should Be Replaced introduces a new smoother called "Modified Sinc Smoother" which cuts short all the disadvantages of the Savitzky-Golay-filter - first and foremost that it does many things but fully removing noise. The publication really goes deep into the topic and explains the reasons for this quite nicely.

Given this, an implementation of the Modified Sinc Smoother would be a nice addition to chemotools. It would be light-weight smoother (not heavy like WhittakerSmooth) with yet excellent smoothing capabilities.
Fun fact: the Savitzky-Golay and Modified Sinc Smoother share the same parameters (window_length and polyorder; the publication states that polyorder is the wrong naming convention and should be poly_degree), so the Modified Sinc Smoother could become a simple drop-in-replacement for the Savitzky-Golay.

🧑‍💻 Implementation aspects

Both the Savitzky-Golay- and the Modified Sinc Smoother belong to the class of so-called Finite Impulse Response Filters.
Their main principle is to convolve the signal with a filter. The only thing they differ in is the shape of the filter, so basically both smoothers can share the same base class. Then, they only provide the computation of their filter coefficients which keeps the transformer classes themselves lean while reusing a lot of logic that then only has to be tested once.
All that's required is a common interface where the individual filters can hand over their respective filter coefficients as depicted below.

Implement maximum entropy deconvolution

🚶➡️🏃 Proposed Enhancements

In some spectroscopic fields like UV/Vis- or MIR-spectroscopy (of liquid systems), peaks can show very strong overlap. This limits the usefulness of many spectroscopic analysis techniques, e.g., Multivariate Fitting with reference spectra.
Having the peaks more resolved while keeping the noise suppressed would be a nice additional pre-processing step. Actually, using derivative spectroscopy is only a workaround to achieve just this.
This is easily depicted, e.g., looking at the derivaties of the MIR-spectra of some enzymes for protein analysis taken from

Baldassarre, et al., Simultaneous Fitting of Absorption Spectra and Their Second Derivatives for an Improved Analysis of Protein Infrared Spectra, Molecules 2015, 20(7), 12599-12622

The second and fourth order derivative reveal the overlapped peaks, but that's not easily achieved in practice where noise limits the usefulness of derivation.

However, in Lórenz-Fonfriá & Padrós, Maximum Entropy Deconvolution of Infrared Spectra: Use of a Novel Entropy Expression Without Sign Restriction, Applied Spectroscopy, 2005, Volume 59, Number 4, a quite powerful deconvolution technique based on Maximum Entropy Deconvolution was proposed. This can achieve such a peak resolution as well, but more resistant to noise. All in all, it circumvents the smoothing which would be mandatory as a pre-processing for taking derivatives.
From a resolution perspective, the results look promising (Figure A is sharpened to Figure C):

🧑‍💻 Implementation details

For this approach to work, weights have to provided (could be achieved with the functionality added for #44 and #120).
The publication provides some implementation details on how to solve the underlying Nonlinear Optimization problem via a hand-crafted Conjugate Gradient method, but I think scipy.optimize.minimize offers more functionality to solve this in a graceful fashion:

the problem incorporates a penalty weight $\lambda$ just like the Whittaker-Smoother. However, adjusting it to meet the reduced Chi-squared criterion proposed requires the evaluation of multiple $\lambda$-values which will be way more expensive than for the Whittaker-Smoother that has a straightforward linear solution. Reformulating the problem to Maximize the entropy with the constraint that the Chi-squared criterion is roughly 1 should be more adequate and require only a single (but longer) optimization run.
this would also allow to formulate the Jacobian and Hessian of the system in a more straigthforward way. Having this kind of gradient information is crucial for Nonlinear Optimization if we are aiming for speed. The Jacobian and the Hessian for the Chi-squared constraint can then be formulated as sparse matrices/linear operators that define the convolution operations in a very efficient way. On the other hand, the Jacobian and Hessian of the entropy terms can then be computed stand-alone rather than having the weighted Sum of Squared Residuals term to consider.

This is already a quite deep dive 🤿 into optimization theory, and I hope I can visualise 📊 it in a better way once the basic implementation is settled.

BaselineShift fails on 2d array with `dtype=int`

BaselineShift class fails on an int dtype 2d array. The issue lies on the implementation here. The issue is this:

x = np.array([[1,2,3]]) # dtype int
new_x = np.add(x, 0.5) # dtype float
x[0] = new_x 
x >>> np.array([[1,2,3]])  # you cannot replace arrays with different dtypes

There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to float. Or we can use numpy.apply_along_axis which fixes the issue we were having by first creating an array of zeros (which by default is float dtype) and then does the replacement thing.

I don't know if this is a major issue, but could be useful regardless of its importance.

Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)

paucablop / chemotools Goto Github PK

chemotools's Introduction

Hi there 👋 I am Pau Cabaneros (paucablop)!

chemotools's People

Contributors

Stargazers

Watchers

Forkers

chemotools's Issues

Description

👉 The column names are lost after the transformation

👉 The API does not work when the transformer reduces the number of features

Hacktoberfest Challenge

How to Contribute

Contact

Resources

Description

Improving Attempts

Hacktoberfest Challenge

How to Contribute

Contact

Resources

To view examples of installing some common dependencies, click the "Open Examples" button below.

🚶➡️🏃 Proposed Enhancement

🧑‍💻 Implementation Details

🚶➡️🏃 Proposed Enhancement

🚶➡️🏃 Proposed Enhancement

🧑‍💻 Implementation aspects

🚶➡️🏃 Proposed Enhancements

🧑‍💻 Implementation details

Recommend Projects

Recommend Topics

Recommend Org

Jobs

To view examples of installing some common dependencies, click the
"Open Examples" button below.