The chemotools's discuss from paucablop

chemotools's Issues

Extended Multiplicative Scatter Correction

Harmonize naming convention with other libraries for better LLM integration

Some of the functions implemented in chemotools, are similar to other functions available in other libraries. For example, the savitzky golay filter, which is inherited from scipy. The way to call the arguments from of the Savitzky Golay method from chemotools, should be the same as in scipy, so that LLM have a larger chance to make the right suggestions.

Linear correction

L-Normalize

Add data Augmentation module

Knowledge based transformation for data augmentation

Add logger to the different functions

SelectByIndices

Deploy package to conda

Savitzky Golay Filter

Add robust normal variate

Add subtraction of reference spectrum

Standard Normal Variate

Improve `WhittakerSmooth`, `AirPLS`, and `ArPLS` performance - sparse matrix operations

Description

AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the csc_matrix representation from scipy.sparse to optimize performance, but further improvements are needed.

Improving Attempts

To improve the performance, I have tried just-in-time compilation of some key functions using numba. However, numba does not support the csc_matrix type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for a numba compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms

Here are some ideas to work on:

Find a more efficient way to JIT compile the code using tools like numba.
Investigate parallel or distributed computing techniques to speed up the processing of multiple spectra.

How to Contribute

Here is the contributing guidelines

Contact

We can have the the conversation in the Issue or the Discussion

Resources

Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:

Paper on AirPLS: Z.-M. Zhang, S. Chen, and Y.-Z. Liang, Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst 135 (5), 1138-1146 (2010).
Paper on ArPLS: Sung-June Baek, Aaron Park, Young-Jin Ahn, Jaebum Choo Baseline correction using asymmetrically reweighted penalized least squares smoothing

Remove tuple in input of baseline correction instances

Improve compatability with the ```set_output``` API from ```scikit-learn```

Description

All the transformers from chemotools are compatible with scikit-learn, that is the objective of chemotools 👍. In one of the most recent releases of scikit-learn they have introduced the set_output API, which basically allows the user to define an pandas as output. This will produce a pandas.DataFrame object as output instead of the default numpy.ndarray. This works fine with most of chemotools transformers, but I have some specific issues:

👉 The column names are lost after the transformation

When I use a chemotools transformer setup to produce a pandas.DataFrame, it does not keep the column names, and produces an output without column names. I have compared the functionality with other scikit-learn transformers (such as StandardScaler(), and I have seen that they do keep the column names in the output.

👉 The API does not work when the transformer reduces the number of features

Some transformers will reduce the number of features on our dataset (e.g., will select a subset of columns from it). These are under the variable selection transformers. I don't really know how to fix this issue.

Hacktoberfest Challenge

We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve compatibility with the set_output API

How to Contribute

Here is the contributing guidelines

Contact

[We can have the the conversation in the Issue or the Discussion](#45)

Resources

👉 Link to set_output API form scikit-learn

👉Link to problem description

Normalize by min/max

Add docstrings

Add docstrings to the following methods:

Scatter

Multiplicative signal correction
Standard normal variate

Baseline

Derivative

Norris william
Savitzky golay

Scale

Index scaler
Min max scaler
norm scaler

Smooth

Mean filter
Median filter
Savitzky golay filter
Whittaker smooth

Variable selection

Range cut

ModuleNotFoundError when loading from chemotools.variable_selection import RangeCut

I wanted to try your Coffee Spectra Classifier exercise in google colab but when I run the line

from chemotools.variable_selection import RangeCut

I receive the following error.

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 7>()
5 from chemotools.scatter import StandardNormalVariate
6
----> 7 from chemotools.variable_selection import RangeCut
8 pipeline = make_pipeline(
9 StandardNormalVariate(),

ModuleNotFoundError: No module named 'chemotools.variable_selection'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

Can you please help?
Thank you

Savitzky Golay

Migrate to pyptoject.toml

Migrate project packaging configuration from setup.py to pyproject.toml

Range Cut

Enable set output API

Factor sklearn cheks out of each function

Try to place all the sklearn checks out of the main body of each function to avoid too much repeated code 😎😎

Substitute check_array by ._validate_data()

Right now, the input in each method is checked by a custom function called check_input. This function uses the check_array() function from scikit-learn.utils. However, is a better practice to use the ._validate_data() inherited from the BaseEstimator class. in addition, the .validate_data() will also set the n_features_in, so we do not have to do it explicitly, making the code more readable.

Mean filter

Multiplicative Scatter Correction

Range Cut by Wavenumbers

Use of TensorLy for N-way data?

Would anyone be interested in adapting functions from TensorLy with a more chemometrics-esque API?

Norris Williams

Splines baseline correction

Improve feature selection integration with sklearn API

Inherit SelectorMixin class from sklearn
Rewrite current selection methos

Improve speed in ArPls()

Improve perfomance ow whittaker

Use sparse algebra to improve the performance of the filter

BaselineShift fails on 2d array with `dtype=int`

BaselineShift class fails on an int dtype 2d array. The issue lies on the implementation here. The issue is this:

x = np.array([[1,2,3]]) # dtype int
new_x = np.add(x, 0.5) # dtype float
x[0] = new_x 
x >>> np.array([[1,2,3]])  # you cannot replace arrays with different dtypes

There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to float. Or we can use numpy.apply_along_axis which fixes the issue we were having by first creating an array of zeros (which by default is float dtype) and then does the replacement thing.

I don't know if this is a major issue, but could be useful regardless of its importance.

Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)

Include variables of importance for projections (only works with PLS like models)
Include selectivity ratio (only works for PLS like models)

paucablop / chemotools Goto Github PK

chemotools's Issues

Description

Improving Attempts

Hacktoberfest Challenge

How to Contribute

Contact

Resources

Description

👉 The column names are lost after the transformation

👉 The API does not work when the transformer reduces the number of features

Hacktoberfest Challenge

How to Contribute

Contact

Resources

To view examples of installing some common dependencies, click the "Open Examples" button below.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

To view examples of installing some common dependencies, click the
"Open Examples" button below.