Comments (9)
Hi @paucablop. Well, I was working in NMR data, and forgot to change the dtype
to float (the intensities are integers).
from chemotools.
Hi @acmoudleysa awesome, thanks a lot for the help ✌️
I have assigned you to the task, and opened a branch associated to this issue for the development (87-baselineshift-fails-on-2d-array-with-dtype=int). 🤩
Scope
Following our previous discussion we should update both the .fit()
and the .transform()
methods for all the functions in the project.
Testing
After you have implemented the changes, we should also make sure to implement the corresponding tests in the test folder. Since most of the testing is handled the scikit-learn
testing check for each function, I think it is enough if we only write a single test for one function (of your choice) proving that when an int
array is given as input, we obtain an np.float
array after applying the .transform()
method.
If you have any questions or doubts, please do not hesitate in reaching out! 😄!
Once again, thank you very much for your help! 💪
from chemotools.
Although I have to mention that, spectroscopic data are not of dtype
int
.
The other classes like (SNV) faces similar problem with dtype. If you agree, we can just edit the check_input.py
and convert them to float
. This way we won't have to make changes everywhere.
from chemotools.
Hey @acmoudleysa! 🌟 Thanks a lot for bringing this up, and my sincere apologies for the delay in getting back to you! I'm thrilled to dive into this and explore some solutions together. 😊 Let me do a bit of digging, and I'll come back to you with some ideas. Once we've got a plan in place, we can jump right into implementation! 🚀
Out of curiosity, what led you to stumble upon spectra with int? 💪
Thanks for flagging this and looking forward to our next steps together! 🎉
from chemotools.
Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py
functions and instead incorporating X = self._validate_data(X)
into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn
preprocessing methods:
For the fit()
method:
X = self._validate_data(X, dtype=FLOAT_DTYPES)
For the transform()
method we could substitute the current:
X = check_input(X)
X_ = X.copy()
by something like this:
X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)
This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools
. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!
Again, many thanks for your help with this :happy:
from chemotools.
Hi @paucablop. Yes you can assign the task to me. I am happy to contribute.
from chemotools.
Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the
check_input.py
functions and instead incorporatingX = self._validate_data(X)
into our workflow. My suggestion is to handle this validation similarly to how it's done in otherscikit-learn
preprocessing methods:For the
fit()
method:X = self._validate_data(X, dtype=FLOAT_DTYPES)For the
transform()
method we could substitute the current:X = check_input(X) X_ = X.copy()by something like this:
X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)This adjustment should be applied to both the fit and transform methods across all preprocessing functions in
chemotools
. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!Again, many thanks for your help with this :happy:
Hey @paucablop! I was going through the source code of _validate_data
method of BaseEstimator
class, and I realized that on this line, the _check_n_features(X, reset=reset)
does the check for the number of features during transform
, which means we might not need these lines in our transform
methods (I will check if all of the preprocessing classes have this line):
# Check that the number of features is the same as the fitted data
if X_.shape[1] != self.n_features_in_:
raise ValueError(f"Expected {self.n_features_in_} features but got {X_.shape[1]}")
The trick is to self._validate_data(..., reset=True)
inside the fit
method, which sets the n_features_in_
and to self._validate_data(..., reset=False)
inside transform
which makes sure the feature length matches (Here)
from chemotools.
Hi,
The trick is to self.validate_data(..., reset=True) inside the fit method, which sets the n_features_in and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)
I'm currently working on #44 and there I already replaced it. I used BaseEstimator._validate_data(..., reset=True)
to be more concise about which of the parent classes is used because in the updated implementations, the Whittaker smoother and the two PLS-baseline algorithms inherit from 4 classes (see this example).
@paucablop @acmoudleysa Is there something standing against a specific type conversion that relies on a very private class variable? For example, for the Whittaker smoother and the two PLS-baseline algorithms, it is numerically instable to go for, e.g., np.float32
for large signals. Thus the conversion has to go for float64
. For some other estimators, it could make sense to only go for np.float32
for performance because spectra usually don't have more significant digits (if at all). Because then, a class variable could be used to solved this like
class Estimator(...):
"""
Estimates something
"""
__dtype_work: Type = np.float64 # this is a class variable
def __init__(...):
...
def fit(X, ...):
...
if X.dtype == self.__dtype_work:
BaseEstimator._validate_data(X, reset=True)
else:
# X = X.astype(self.__dtype_work) might be required for some estimators if X is used further in fit
BaseEstimator._validate_data(X.astype(self.__dtype_work), reset=True)
...
@paucablop @acmoudleysa Regarding the tests, do you think it would make sense to just test all estimators with pystest.mark.parametrize
and specify the estimator involved and the Arrays of integers to fit and transform? If there is only a test for a single estimator this feels a bit like going against the idea behind automated tests. A test could be as simple as
# the following spectra could also be fixtures
SPECTRUM_1D_00 = ... # some 1D spectrum of integers
SPECTRUM_1D_01 = ... # another 1D spectrum of integers
SPECTRUM_2D_00 = ... # some 2D spectrum of integers
SPECTRUM_2D_01 = ... # another 2D spectrum of integers
@pytest.mark.parametrize(
"combination", [
(WhittakerSmooth, SPECTRUM_1D_00), # could also hold an expected, but the test's purpose is not correctness
(WhittakerSmooth, SPECTRUM_2D_00),
...
(NorrisWilliams, SPECTURM_1D_01),
(NorrisWilliams, SPECTRUM_2D_01),
...
)
def test_fit_transform_with_int(combination):
estimator, X = combination
estimator.fit_transform(X) # if this method runs, the test is passed because it didn't raise an error
Just some ideas. Maybe you have some feedback or further suggestions 🙃
from chemotools.
Related Issues (20)
- Normalize by given index HOT 1
- SelectByIndices
- Implement Weighted Least Squares HOT 1
- Enable set output API HOT 2
- Improve speed in ArPls() HOT 1
- Add docstrings
- Add data Augmentation module
- Improve `WhittakerSmooth`, `AirPLS`, and `ArPLS` performance - sparse matrix operations HOT 14
- Improve compatability with the ```set_output``` API from ```scikit-learn```
- Migrate to pyptoject.toml
- Improve feature selection integration with sklearn API
- Harmonize naming convention with other libraries for better LLM integration
- Substitute check_array by ._validate_data()
- Implement variable of importance for projections and selectioc raito HOT 7
- Deploy package to conda
- Provide Polars support when scikit-learn 1.4 is released
- Use of TensorLy for N-way data?
- SPC file format reader HOT 2
- ModuleNotFoundError when loading from chemotools.variable_selection import RangeCut HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chemotools.