GithubHelp home page GithubHelp logo

Comments (8)

lesteve avatar lesteve commented on May 18, 2024

It kinds of remind me of #27307 and #27691. In general, copy=False means "it may avoid a copy" but it is not always possible. In other words, copy=False may reduce the memory usage, but the fact that the input array is modified should not be relied upon.

I think there are two complementary ways to improve the situation:

  • change the docstring to a similar wording as in #27691
  • if the code changes that you suggest improves the situation, why not? I have to say I am not sure we have some tests to check the copy=False behaviour.

from scikit-learn.

aivarsoo avatar aivarsoo commented on May 18, 2024

Regarding the first way. In this case, I guess the docstring should qualify that only X is affected by copy, but y is always copied in addition to the issues with sparse matrices.

I am a bit biased here as I want y to have the same behavior of X for my application, but I am trying to understand if there's a reason to always copy y?

from scikit-learn.

lesteve avatar lesteve commented on May 18, 2024

It would help a lot if you could explain your use case a bit 🙏. In particular how does it affect you that y is copied and not X when using copy=False?

from scikit-learn.

aivarsoo avatar aivarsoo commented on May 18, 2024

This issue may go in a bit different direction now :) Let me know if i should separate it into a new one.

I am updating a third party sklearn plug-in for feature selection, where we want to add a possibility for feature selection using conditional mutual information based on the paper. Note that unlike the paper itself, our solution will use a proprietary solver, but the plug-in itself is open-source.

Mathematically we will aim to solve

$$ \max\limits_{\delta_i} \sum\limits_{i=1}^n I(x_i; y) \delta_i + \sum\limits_{i, j=1}^n I(x_i; y | x_j) \delta_i \delta_j $$

where $x_i$ are features, $y$ is the target variable, and the variables $\delta_i\in{0,1}$ are binary with an associated constraint $\sum \delta_i = K$ for a pre-defined value $K$. If the variable $\delta_i =1$ then we will pick the feature $i$.

The conditional mutual information algorithm will follow mostly this implementation, which is in turn based on the sklearn implementation for mutual information computation.

Essentially, I want to compute the values $I(x_i; y)$, which I can do using the public methods mutual_info_classif, mutual_info_regression from sklearn.feature_selection or a private method _compute_mi from sklearn.feature_selection._mutual_info.

  1. If I use the public method mutual_info_regression, X and y will be rescaled if they correspond to a data of a non-discrete distribution. I will then need to rescale y again for computing $I(x_i; y | x_j)$. If X and y would have the same behavior, then I can avoid rescaling y twice and essentially using different data y for computing $I(x_i; y)$ and $I(x_i; y | x_j)$.
  2. If we can make the method _compute_mi public, then I can use my custom function for computing the values of $I(x_i; y | x_j)$ and $I(x_i; y)$ simultaneously.

The second option would also allow for using the problem formulation described in the mRMR paper. Similar to an existing issue #8889 but using our solver. I was actually considering making the second option (making _compute_mi public) an issue too, but decided to start with a smaller issue.

I am happy to submit a PR on some of these issues.

from scikit-learn.

lesteve avatar lesteve commented on May 18, 2024

Thanks for the details!

So my understanding is that you do something like this:

mutual_info_1 = mutual_info_regression(X, y, copy=False)
# make assumptions on how X and y have been modified and call `mutual_info_regression` again
mutual_info_2 = mutual_info_regression(..., copy=False)

How much do you care about using copy=False, in other words, does using copy=True have a big effect in your use case? One suggestion would be to use copy=True in the first mutual_info_regression call and then copy=False in the second one (if that really makes a difference).

My current understanding that may not be fully accurate: when using copy=False there is no guarantee on how your input values may change, basically you may avoid a copy at the cost of not being able to use your input array after the function call.

Even if copy=False makes a big difference in your use case, I would rather rely on a private function that on how X and y are modified when using copy=False. I don't think there is any guarantee on this and this may change without warnings, at least if we rename or remove _compute_mi you will get an error.

Then there is the question on _compute_mi public, I am not very familiar with this part of scikit-learn, so I don't have an informed opinion. One thing to have in mind is that having an additional public function could lead users to confusion and/or not using the correct public function.

from scikit-learn.

aivarsoo avatar aivarsoo commented on May 18, 2024

Thanks for the response!

An example code would be something like

# We rescale X and y in the sklearn method computing I(x_i; y) for all i:
mutual_info = mutual_info_regression(X, y, copy=False)
# We re-use the rescaled X and y in our method computing I(x_i; y| x_j) 
# for all i and j , where we do not modify X or y
conditional_mutual_info = conditional_mutual_info(X, y)

Note that the mutual_info methods rescale and add noise to X and y, which I wouldn't want to do several times. Also I want to use the same X and y for all the computations/estimations.

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality? If so then I may need to look for a different solution anyway.

On another note, I think importing the private sklearn method wouldn't work in the long run for our plug-in.

from scikit-learn.

lesteve avatar lesteve commented on May 18, 2024

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality?

Yes, I asked other maintainers inputs and basically there seems to be agreement on this: if you use sklearn_func(X, copy=False) don't reuse X because there is no guarantee on how its values have changed after the sklearn_func call.

I am going to label this issue as Documentation because I think the most reasonable thing to do would be to document this in the glossary. I guess adding a link to the glossary entry in all the function/classes that uses copy may be a good idea too but quite some work.

from scikit-learn.

aivarsoo avatar aivarsoo commented on May 18, 2024

Many thanks!

from scikit-learn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.