GithubHelp home page GithubHelp logo

Comments (9)

acmoudleysa avatar acmoudleysa commented on May 29, 2024 1

Hi @paucablop. Well, I was working in NMR data, and forgot to change the dtype to float (the intensities are integers).

from chemotools.

paucablop avatar paucablop commented on May 29, 2024 1

Hi @acmoudleysa awesome, thanks a lot for the help ✌️

I have assigned you to the task, and opened a branch associated to this issue for the development (87-baselineshift-fails-on-2d-array-with-dtype=int). 🤩

Scope

Following our previous discussion we should update both the .fit() and the .transform() methods for all the functions in the project.

Testing

After you have implemented the changes, we should also make sure to implement the corresponding tests in the test folder. Since most of the testing is handled the scikit-learn testing check for each function, I think it is enough if we only write a single test for one function (of your choice) proving that when an int array is given as input, we obtain an np.float array after applying the .transform() method.

If you have any questions or doubts, please do not hesitate in reaching out! 😄!

Once again, thank you very much for your help! 💪

from chemotools.

acmoudleysa avatar acmoudleysa commented on May 29, 2024

Although I have to mention that, spectroscopic data are not of dtype int.

The other classes like (SNV) faces similar problem with dtype. If you agree, we can just edit the check_input.py and convert them to float. This way we won't have to make changes everywhere.

from chemotools.

paucablop avatar paucablop commented on May 29, 2024

Hey @acmoudleysa! 🌟 Thanks a lot for bringing this up, and my sincere apologies for the delay in getting back to you! I'm thrilled to dive into this and explore some solutions together. 😊 Let me do a bit of digging, and I'll come back to you with some ideas. Once we've got a plan in place, we can jump right into implementation! 🚀

Out of curiosity, what led you to stumble upon spectra with int? 💪

Thanks for flagging this and looking forward to our next steps together! 🎉

from chemotools.

paucablop avatar paucablop commented on May 29, 2024

Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py functions and instead incorporating X = self._validate_data(X) into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn preprocessing methods:

For the fit() method:

X = self._validate_data(X, dtype=FLOAT_DTYPES)

For the transform() method we could substitute the current:

X = check_input(X)
X_ = X.copy()

by something like this:

X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)

This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!

Again, many thanks for your help with this :happy:

from chemotools.

acmoudleysa avatar acmoudleysa commented on May 29, 2024

Hi @paucablop. Yes you can assign the task to me. I am happy to contribute.

from chemotools.

acmoudleysa avatar acmoudleysa commented on May 29, 2024

Thank you! After some investigation, I've come to the conclusion that the most effective approach would be to validate the data during the fitting and transform process for each method. I'm in the process of transitioning away from using the check_input.py functions and instead incorporating X = self._validate_data(X) into our workflow. My suggestion is to handle this validation similarly to how it's done in other scikit-learn preprocessing methods:

For the fit() method:

X = self._validate_data(X, dtype=FLOAT_DTYPES)

For the transform() method we could substitute the current:

X = check_input(X)
X_ = X.copy()

by something like this:

X_ = self._validate_data(X, dtype=FLOAT_DTYPES, copy=True)

This adjustment should be applied to both the fit and transform methods across all preprocessing functions in chemotools. Let me know if you're interested in contributing to this task and I'll happily assign it to you and we can start discussing the implementation and testing!

Again, many thanks for your help with this :happy:

Hey @paucablop! I was going through the source code of _validate_data method of BaseEstimator class, and I realized that on this line, the _check_n_features(X, reset=reset) does the check for the number of features during transform, which means we might not need these lines in our transform methods (I will check if all of the preprocessing classes have this line):

        # Check that the number of features is the same as the fitted data
        if X_.shape[1] != self.n_features_in_:
            raise ValueError(f"Expected {self.n_features_in_} features but got {X_.shape[1]}")

The trick is to self._validate_data(..., reset=True) inside the fit method, which sets the n_features_in_ and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)

from chemotools.

MothNik avatar MothNik commented on May 29, 2024

Hi,

@acmoudleysa

The trick is to self.validate_data(..., reset=True) inside the fit method, which sets the n_features_in and to self._validate_data(..., reset=False) inside transform which makes sure the feature length matches (Here)

I'm currently working on #44 and there I already replaced it. I used BaseEstimator._validate_data(..., reset=True) to be more concise about which of the parent classes is used because in the updated implementations, the Whittaker smoother and the two PLS-baseline algorithms inherit from 4 classes (see this example).

@paucablop @acmoudleysa Is there something standing against a specific type conversion that relies on a very private class variable? For example, for the Whittaker smoother and the two PLS-baseline algorithms, it is numerically instable to go for, e.g., np.float32 for large signals. Thus the conversion has to go for float64. For some other estimators, it could make sense to only go for np.float32 for performance because spectra usually don't have more significant digits (if at all). Because then, a class variable could be used to solved this like

class Estimator(...):
    """
    Estimates something
    """
    
    __dtype_work: Type = np.float64 # this is a class variable

    def __init__(...):
        ...

    def fit(X, ...):
        ...
        if X.dtype == self.__dtype_work:
            BaseEstimator._validate_data(X, reset=True)
        else:
            # X =  X.astype(self.__dtype_work) might be required for some estimators if X is used further in fit
            BaseEstimator._validate_data(X.astype(self.__dtype_work), reset=True)

    ...

@paucablop @acmoudleysa Regarding the tests, do you think it would make sense to just test all estimators with pystest.mark.parametrize and specify the estimator involved and the Arrays of integers to fit and transform? If there is only a test for a single estimator this feels a bit like going against the idea behind automated tests. A test could be as simple as

# the following spectra could also be fixtures
SPECTRUM_1D_00 = ... # some 1D spectrum of integers
SPECTRUM_1D_01 = ... # another 1D spectrum of integers
SPECTRUM_2D_00 = ... # some 2D spectrum of integers
SPECTRUM_2D_01 = ... # another 2D spectrum of integers

@pytest.mark.parametrize(
    "combination", [
        (WhittakerSmooth, SPECTRUM_1D_00), # could also hold an expected, but the test's purpose is not correctness
        (WhittakerSmooth, SPECTRUM_2D_00),
        ...
        (NorrisWilliams, SPECTURM_1D_01),
        (NorrisWilliams, SPECTRUM_2D_01),
        ...
)
def test_fit_transform_with_int(combination):
    estimator, X = combination
    estimator.fit_transform(X) # if this method runs, the test is passed because it didn't raise an error

Just some ideas. Maybe you have some feedback or further suggestions 🙃

from chemotools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.