GithubHelp home page GithubHelp logo

paucablop / chemotools Goto Github PK

View Code? Open in Web Editor NEW
44.0 44.0 6.0 31.07 MB

Integrate your chemometric tools with the scikit-learn API πŸ§ͺ πŸ€–

Home Page: https://paucablop.github.io/chemotools/

License: MIT License

Python 99.98% Ruby 0.01% TeX 0.02%
artificial-intelligence autoencoders chemometrics deep-learning hacktoberfest ir-spectroscopy machine-learning multivariate-analysis nir-spectroscopy python raman-spectroscopy scikit-learn sklearn spectra spectroscopy

chemotools's Introduction

Hi there πŸ‘‹ I am Pau Cabaneros (paucablop)!

πŸ‘€ I love modelling biological 🧬 and chemical βš› systems to better understand how they can be used to help society 🌍

πŸ”­ I am currently working at as a process analytical technology scientist and data scientist in the biotech industry πŸ‘¨β€πŸ”¬

β™₯ I enjoy open source projects and sharing knowledge through code πŸ“š

🌱 I am working to improve my devops and automation skills πŸ€Ήβ€β™€οΈ

ResearchGate Profile

paucablop's github stats

chemotools's People

Contributors

dependabot[bot] avatar paucablop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chemotools's Issues

SPC file format reader

Are there any plans to include an SPC file format reader? Or can you recommend a different Python OSS library that can be used with chemotools?

Improve compatability with the ```set_output``` API from ```scikit-learn```

Description

All the transformers from chemotools are compatible with scikit-learn, that is the objective of chemotools πŸ‘. In one of the most recent releases of scikit-learn they have introduced the set_output API, which basically allows the user to define an pandas as output. This will produce a pandas.DataFrame object as output instead of the default numpy.ndarray. This works fine with most of chemotools transformers, but I have some specific issues:

πŸ‘‰ The column names are lost after the transformation

When I use a chemotools transformer setup to produce a pandas.DataFrame, it does not keep the column names, and produces an output without column names. I have compared the functionality with other scikit-learn transformers (such as StandardScaler(), and I have seen that they do keep the column names in the output.

πŸ‘‰ The API does not work when the transformer reduces the number of features

Some transformers will reduce the number of features on our dataset (e.g., will select a subset of columns from it). These are under the variable selection transformers. I don't really know how to fix this issue.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve compatibility with the set_output API

How to Contribute

Here is the contributing guidelines

Contact

[We can have the the conversation in the Issue or the Discussion](#45)

Resources

πŸ‘‰ Link to set_output API form scikit-learn

πŸ‘‰Link to problem description

🧰 Add initial `Makefile` version and tiny test CI pipeline

With a simple Makefile commands like make test could be used to trigger all the tests with the correct specifiction of arguments instead of a very long command like pytest --cov=chemotools .\tests -n=auto --cov-report html -x.
Another example would be make install-dev for python -m install --upgrade pip setuptools wheel -r .\requirements-dev.txt ..

As an added bonus, that would enable the setup of simple test CI pipeline that runs tests whenever a pull request on main is opened and changed. Ideally, the CI pipeline would test multiple Python versions at the same time.

Improve `WhittakerSmooth`, `AirPLS`, and `ArPLS` performance - sparse matrix operations

Description

AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the csc_matrix representation from scipy.sparse to optimize performance, but further improvements are needed.

Improving Attempts

To improve the performance, I have tried just-in-time compilation of some key functions using numba. However, numba does not support the csc_matrix type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for a numba compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms

Here are some ideas to work on:

  • Find a more efficient way to JIT compile the code using tools like numba.
  • Investigate parallel or distributed computing techniques to speed up the processing of multiple spectra.

How to Contribute

Here is the contributing guidelines

Contact

We can have the the conversation in the Issue or the Discussion

Resources

Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:

  • Paper on AirPLS: Z.-M. Zhang, S. Chen, and Y.-Z. Liang, Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135 (5), 1138-1146 (2010).
  • Paper on ArPLS: Sung-June Baek, Aaron Park, Young-Jin Ahn, Jaebum Choo Baseline correction using asymmetrically reweighted penalized least squares smoothing

Harmonize naming convention with other libraries for better LLM integration

Some of the functions implemented in chemotools, are similar to other functions available in other libraries. For example, the savitzky golay filter, which is inherited from scipy. The way to call the arguments from of the Savitzky Golay method from chemotools, should be the same as in scipy, so that LLM have a larger chance to make the right suggestions.

ModuleNotFoundError when loading from chemotools.variable_selection import RangeCut

I wanted to try your Coffee Spectra Classifier exercise in google colab but when I run the line

from chemotools.variable_selection import RangeCut

I receive the following error.


ModuleNotFoundError Traceback (most recent call last)
in <cell line: 7>()
5 from chemotools.scatter import StandardNormalVariate
6
----> 7 from chemotools.variable_selection import RangeCut
8 pipeline = make_pipeline(
9 StandardNormalVariate(),

ModuleNotFoundError: No module named 'chemotools.variable_selection'


NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Can you please help?
Thank you

Substitute check_array by ._validate_data()

Right now, the input in each method is checked by a custom function called check_input. This function uses the check_array() function from scikit-learn.utils. However, is a better practice to use the ._validate_data() inherited from the BaseEstimator class. in addition, the .validate_data() will also set the n_features_in, so we do not have to do it explicitly, making the code more readable.

Add docstrings

Add docstrings to the following methods:

Scatter

  • Multiplicative signal correction
  • Standard normal variate

Baseline

  • AirPls
  • ArPls
  • Constant baseline correction
  • Cubic spline correction
  • Linear correction
  • Non negativity
  • Polynomial correction
  • Subtract reference

Derivative

  • Norris william
  • Savitzky golay

Scale

  • Index scaler
  • Min max scaler
  • norm scaler

Smooth

  • Mean filter
  • Median filter
  • Savitzky golay filter
  • Whittaker smooth

Variable selection

  • Range cut

Add `pybaseline` support

πŸšΆβž‘οΈπŸƒ Proposed Enhancement

We could add a super general purpose estimator that integrates the pybaseline package for elaborate baseline correction. The package is great and offers a variety of baseline algorithms, among them erPLS for automated selection of the smoothing parameter.
However, it is purely function-based and this is where chemotools comes in.

πŸ§‘β€πŸ’» Implementation Details

If we keep a general estimator like

PyBaselineCorrection(algorithm, algorithm_kwargs)

we could integrate the full package with just a single estimator.
We can make the algorithm_kwargs dictionary- and also dataclass-based because the latter does not require the user to look at two documentations of two packages at the same time πŸ€” In case we like an algorithm a lot, we can consider transitioning this to its own dedicated estimator.
It will be a lof of tedious copying of specifications, but if that enables a great feature, why not?

Enable automated window size determination of Savitzky-Golay and Mean filter (maybe also Median filter)

πŸšΆβž‘οΈπŸƒ Proposed Enhancement

Today, I answered this StackExchange Question on the Savitzky-Golay filter and thereby I figured out that the automated selection of the Savitzky-Golay (and thus the mean filter because it is the Savitzky-Golay filter with polynomial degree 0 😱) is easily possible via Cross Validation.
After the WhittakerSmooth might get a facelift (#120) with that enables automated smoothing, I think this is a natural way to proceed. If we trick a bit, actually also the median filter can get this update, but the numerics might be a bit more involved.

Implement successor of Savitzky-Golay: The Modified Sinc Smoother

πŸšΆβž‘οΈπŸƒ Proposed Enhancement

In 2023, some researchers put the Savitzky-Golay-filter to the test and figured out that - despite its wide-spread use - it is actually far from being a good smoother.
Their publication Why and How Savitzky–Golay Filters Should Be Replaced introduces a new smoother called "Modified Sinc Smoother" which cuts short all the disadvantages of the Savitzky-Golay-filter - first and foremost that it does many things but fully removing noise. The publication really goes deep into the topic and explains the reasons for this quite nicely.

Given this, an implementation of the Modified Sinc Smoother would be a nice addition to chemotools. It would be light-weight smoother (not heavy like WhittakerSmooth) with yet excellent smoothing capabilities.
Fun fact: the Savitzky-Golay and Modified Sinc Smoother share the same parameters (window_length and polyorder; the publication states that polyorder is the wrong naming convention and should be poly_degree), so the Modified Sinc Smoother could become a simple drop-in-replacement for the Savitzky-Golay.

πŸ§‘β€πŸ’» Implementation aspects

Both the Savitzky-Golay- and the Modified Sinc Smoother belong to the class of so-called Finite Impulse Response Filters.
Their main principle is to convolve the signal with a filter. The only thing they differ in is the shape of the filter, so basically both smoothers can share the same base class. Then, they only provide the computation of their filter coefficients which keeps the transformer classes themselves lean while reusing a lot of logic that then only has to be tested once.
All that's required is a common interface where the individual filters can hand over their respective filter coefficients as depicted below.

FIR_Filters

Implement maximum entropy deconvolution

πŸšΆβž‘οΈπŸƒ Proposed Enhancements

In some spectroscopic fields like UV/Vis- or MIR-spectroscopy (of liquid systems), peaks can show very strong overlap. This limits the usefulness of many spectroscopic analysis techniques, e.g., Multivariate Fitting with reference spectra.
Having the peaks more resolved while keeping the noise suppressed would be a nice additional pre-processing step. Actually, using derivative spectroscopy is only a workaround to achieve just this.
This is easily depicted, e.g., looking at the derivaties of the MIR-spectra of some enzymes for protein analysis taken from

Baldassarre, et al., Simultaneous Fitting of Absorption Spectra and Their Second Derivatives for an Improved Analysis of Protein Infrared Spectra, Molecules 2015, 20(7), 12599-12622

image

The second and fourth order derivative reveal the overlapped peaks, but that's not easily achieved in practice where noise limits the usefulness of derivation.

However, in LΓ³renz-FonfriΓ‘ & PadrΓ³s, Maximum Entropy Deconvolution of Infrared Spectra: Use of a Novel Entropy Expression Without Sign Restriction, Applied Spectroscopy, 2005, Volume 59, Number 4, a quite powerful deconvolution technique based on Maximum Entropy Deconvolution was proposed. This can achieve such a peak resolution as well, but more resistant to noise. All in all, it circumvents the smoothing which would be mandatory as a pre-processing for taking derivatives.
From a resolution perspective, the results look promising (Figure A is sharpened to Figure C):

image

πŸ§‘β€πŸ’» Implementation details

For this approach to work, weights have to provided (could be achieved with the functionality added for #44 and #120).
The publication provides some implementation details on how to solve the underlying Nonlinear Optimization problem via a hand-crafted Conjugate Gradient method, but I think scipy.optimize.minimize offers more functionality to solve this in a graceful fashion:

  • the problem incorporates a penalty weight $\lambda$ just like the Whittaker-Smoother. However, adjusting it to meet the reduced Chi-squared criterion proposed requires the evaluation of multiple $\lambda$-values which will be way more expensive than for the Whittaker-Smoother that has a straightforward linear solution. Reformulating the problem to Maximize the entropy with the constraint that the Chi-squared criterion is roughly 1 should be more adequate and require only a single (but longer) optimization run.
  • this would also allow to formulate the Jacobian and Hessian of the system in a more straigthforward way. Having this kind of gradient information is crucial for Nonlinear Optimization if we are aiming for speed. The Jacobian and the Hessian for the Chi-squared constraint can then be formulated as sparse matrices/linear operators that define the convolution operations in a very efficient way. On the other hand, the Jacobian and Hessian of the entropy terms can then be computed stand-alone rather than having the weighted Sum of Squared Residuals term to consider.

This is already a quite deep dive 🀿 into optimization theory, and I hope I can visualise πŸ“Š it in a better way once the basic implementation is settled.

BaselineShift fails on 2d array with `dtype=int`

BaselineShift class fails on an int dtype 2d array. The issue lies on the implementation here. The issue is this:

x = np.array([[1,2,3]]) # dtype int
new_x = np.add(x, 0.5) # dtype float
x[0] = new_x 
x >>> np.array([[1,2,3]])  # you cannot replace arrays with different dtypes

There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to float. Or we can use numpy.apply_along_axis which fixes the issue we were having by first creating an array of zeros (which by default is float dtype) and then does the replacement thing.

I don't know if this is a major issue, but could be useful regardless of its importance.

Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.