BaselineShift class fails on an int dtype 2d array. T

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Although I have to mention that, spectroscopic data are not of <code class="notranslat

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" d

BaselineShift fails on 2d array with `dtype=int` about chemotools HOT 9 OPEN

paucablop commented on May 29, 2024

BaselineShift fails on 2d array with `dtype=int`

from chemotools.

Comments (9)

acmoudleysa commented on May 29, 2024 1

Hi @paucablop. Well, I was working in NMR data, and forgot to change the dtype to float (the intensities are integers).

from chemotools.

paucablop commented on May 29, 2024 1

Hi @acmoudleysa awesome, thanks a lot for the help ✌️

I have assigned you to the task, and opened a branch associated to this issue for the development (87-baselineshift-fails-on-2d-array-with-dtype=int). 🤩

Scope

Following our previous discussion we should update both the .fit() and the .transform() methods for all the functions in the project.

Testing

After you have implemented the changes, we should also make sure to implement the corresponding tests in the test folder. Since most of the testing is handled the scikit-learn testing check for each function, I think it is enough if we only write a single test for one function (of your choice) proving that when an int array is given as input, we obtain an np.float array after applying the .transform() method.

If you have any questions or doubts, please do not hesitate in reaching out! 😄!

Once again, thank you very much for your help! 💪

from chemotools.

acmoudleysa commented on May 29, 2024

Although I have to mention that, spectroscopic data are not of dtype int.

The other classes like (SNV) faces similar problem with dtype. If you agree, we can just edit the check_input.py and convert them to float. This way we won't have to make changes everywhere.

from chemotools.

paucablop commented on May 29, 2024

Hey @acmoudleysa! 🌟 Thanks a lot for bringing this up, and my sincere apologies for the delay in getting back to you! I'm thrilled to dive into this and explore some solutions together. 😊 Let me do a bit of digging, and I'll come back to you with some ideas. Once we've got a plan in place, we can jump right into implementation! 🚀

Out of curiosity, what led you to stumble upon spectra with int? 💪

Thanks for flagging this and looking forward to our next steps together! 🎉

from chemotools.

paucablop commented on May 29, 2024

Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py functions and instead incorporating X = self._validate_data(X) into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn preprocessing methods:

For the fit() method:

X = self._validate_data(X, dtype=FLOAT_DTYPES)

For the transform() method we could substitute the current:

X = check_input(X)
X_ = X.copy()

by something like this:

X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)

This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!

Again, many thanks for your help with this :happy:

from chemotools.

acmoudleysa commented on May 29, 2024

Hi @paucablop. Yes you can assign the task to me. I am happy to contribute.

from chemotools.

acmoudleysa commented on May 29, 2024

Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py functions and instead incorporating X = self._validate_data(X) into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn preprocessing methods:

For the fit() method:
X = self._validate_data(X, dtype=FLOAT_DTYPES)
For the transform() method we could substitute the current:
X = check_input(X)
X_ = X.copy()
by something like this:
X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)
This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!

Again, many thanks for your help with this :happy:

Hey @paucablop! I was going through the source code of _validate_data method of BaseEstimator class, and I realized that on this line, the _check_n_features(X, reset=reset) does the check for the number of features during transform, which means we might not need these lines in our transform methods (I will check if all of the preprocessing classes have this line):

        # Check that the number of features is the same as the fitted data
        if X_.shape[1] != self.n_features_in_:
            raise ValueError(f"Expected {self.n_features_in_} features but got {X_.shape[1]}")

The trick is to self._validate_data(..., reset=True) inside the fit method, which sets the n_features_in_ and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)

from chemotools.

MothNik commented on May 29, 2024

Hi,

@acmoudleysa

The trick is to self.validate_data(..., reset=True) inside the fit method, which sets the n_features_in and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)

I'm currently working on #44 and there I already replaced it. I used BaseEstimator._validate_data(..., reset=True) to be more concise about which of the parent classes is used because in the updated implementations, the Whittaker smoother and the two PLS-baseline algorithms inherit from 4 classes (see this example).

@paucablop @acmoudleysa Is there something standing against a specific type conversion that relies on a very private class variable? For example, for the Whittaker smoother and the two PLS-baseline algorithms, it is numerically instable to go for, e.g., np.float32 for large signals. Thus the conversion has to go for float64. For some other estimators, it could make sense to only go for np.float32 for performance because spectra usually don't have more significant digits (if at all). Because then, a class variable could be used to solved this like

class Estimator(...):
    """
    Estimates something
    """
    
    __dtype_work: Type = np.float64 # this is a class variable

    def __init__(...):
        ...

    def fit(X, ...):
        ...
        if X.dtype == self.__dtype_work:
            BaseEstimator._validate_data(X, reset=True)
        else:
            # X =  X.astype(self.__dtype_work) might be required for some estimators if X is used further in fit
            BaseEstimator._validate_data(X.astype(self.__dtype_work), reset=True)

    ...

@paucablop @acmoudleysa Regarding the tests, do you think it would make sense to just test all estimators with pystest.mark.parametrize and specify the estimator involved and the Arrays of integers to fit and transform? If there is only a test for a single estimator this feels a bit like going against the idea behind automated tests. A test could be as simple as

# the following spectra could also be fixtures
SPECTRUM_1D_00 = ... # some 1D spectrum of integers
SPECTRUM_1D_01 = ... # another 1D spectrum of integers
SPECTRUM_2D_00 = ... # some 2D spectrum of integers
SPECTRUM_2D_01 = ... # another 2D spectrum of integers

@pytest.mark.parametrize(
    "combination", [
        (WhittakerSmooth, SPECTRUM_1D_00), # could also hold an expected, but the test's purpose is not correctness
        (WhittakerSmooth, SPECTRUM_2D_00),
        ...
        (NorrisWilliams, SPECTURM_1D_01),
        (NorrisWilliams, SPECTRUM_2D_01),
        ...
)
def test_fit_transform_with_int(combination):
    estimator, X = combination
    estimator.fit_transform(X) # if this method runs, the test is passed because it didn't raise an error

Just some ideas. Maybe you have some feedback or further suggestions 🙃

from chemotools.

BaselineShift fails on 2d array with `dtype=int` about chemotools HOT 9 OPEN

Comments (9)

Scope

Testing

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs