paucablop / chemotools Goto Github PK
View Code? Open in Web Editor NEWIntegrate your chemometric tools with the scikit-learn API ๐งช ๐ค
Home Page: https://paucablop.github.io/chemotools/
License: MIT License
Integrate your chemometric tools with the scikit-learn API ๐งช ๐ค
Home Page: https://paucablop.github.io/chemotools/
License: MIT License
Some of the functions implemented in chemotools, are similar to other functions available in other libraries. For example, the savitzky golay filter, which is inherited from scipy. The way to call the arguments from of the Savitzky Golay method from chemotools, should be the same as in scipy, so that LLM have a larger chance to make the right suggestions.
Knowledge based transformation for data augmentation
AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) and ArPLS (Asymmetrically Reweighted Penalized Least Squares) are powerful algorithms for removing complex non-linear baselines from spectral signals. However, their computational cost can be significant, especially when processing large numbers of spectra. Currently, we use the csc_matrix
representation from scipy.sparse
to optimize performance, but further improvements are needed.
To improve the performance, I have tried just-in-time compilation of some key functions using numba
. However, numba
does not support the csc_matrix
type, and I cannot JIT compile the code. To overcome this issue, I thought of looking for a numba
compatible representation of sparse matrices, but could not find one. Therefore, I have created my own, together with some functions to make basic algebra operations with them code to Gist. Unfortunately, this did not improve the performance over the current implementation.
We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve the performance of both algorithms
Here are some ideas to work on:
numba
.Here is the contributing guidelines
We can have the the conversation in the Issue or the Discussion
Here are some relevant resources and references for understanding the theory and implementation of the AirPLS and ArPLS algorithms:
All the transformers from chemotools
are compatible with scikit-learn
, that is the objective of chemotools
๐. In one of the most recent releases of scikit-learn
they have introduced the set_output
API, which basically allows the user to define an pandas
as output. This will produce a pandas.DataFrame
object as output instead of the default numpy.ndarray
. This works fine with most of chemotools
transformers, but I have some specific issues:
When I use a chemotools
transformer setup to produce a pandas.DataFrame
, it does not keep the column names, and produces an output without column names. I have compared the functionality with other scikit-learn
transformers (such as StandardScaler()
, and I have seen that they do keep the column names in the output.
Some transformers will reduce the number of features on our dataset (e.g., will select a subset of columns from it). These are under the variable selection transformers. I don't really know how to fix this issue.
We invite open source developers to contribute to our project during Hacktoberfest. The goal is to improve compatibility with the set_output
API
Here is the contributing guidelines
[We can have the the conversation in the Issue or the Discussion](#45)
Add docstrings to the following methods:
Scatter
Baseline
Derivative
Scale
Smooth
Variable selection
I wanted to try your Coffee Spectra Classifier exercise in google colab but when I run the line
from chemotools.variable_selection import RangeCut
I receive the following error.
ModuleNotFoundError Traceback (most recent call last)
in <cell line: 7>()
5 from chemotools.scatter import StandardNormalVariate
6
----> 7 from chemotools.variable_selection import RangeCut
8 pipeline = make_pipeline(
9 StandardNormalVariate(),
ModuleNotFoundError: No module named 'chemotools.variable_selection'
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
Can you please help?
Thank you
Migrate project packaging configuration from setup.py to pyproject.toml
Try to place all the sklearn checks out of the main body of each function to avoid too much repeated code ๐๐
Right now, the input in each method is checked by a custom function called check_input. This function uses the check_array() function from scikit-learn.utils. However, is a better practice to use the ._validate_data() inherited from the BaseEstimator class. in addition, the .validate_data() will also set the n_features_in, so we do not have to do it explicitly, making the code more readable.
Would anyone be interested in adapting functions from TensorLy with a more chemometrics-esque API?
Use sparse algebra to improve the performance of the filter
BaselineShift class fails on an int
dtype 2d array. The issue lies on the implementation here. The issue is this:
x = np.array([[1,2,3]]) # dtype int
new_x = np.add(x, 0.5) # dtype float
x[0] = new_x
x >>> np.array([[1,2,3]]) # you cannot replace arrays with different dtypes
There are several fixes to this, including an easy one which is to just assign the dtype of the array being transformed to float
. Or we can use numpy.apply_along_axis
which fixes the issue we were having by first creating an array of zeros (which by default is float
dtype) and then does the replacement thing.
I don't know if this is a major issue, but could be useful regardless of its importance.
Update: I see that other classes of augmentation have similar implementation and have the same issue. Should I make changes and do a pull request? Let me know (with the solution of your choice)
plotly plots are cooler to display ๐ ๐
Add Weighted Least Squares preprocessing method for baseline correction:
https://pubs.rsc.org/en/content/articlehtml/2015/an/c4an01061b
Are there any plans to include an SPC file format reader? Or can you recommend a different Python OSS library that can be used with chemotools?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.