Describe the bug The parameter copy</cod

It kinds of remind me of <a class="issue-link js-issue-link" data-error-text="Failed t

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False about scikit-learn HOT 8 OPEN

aivarsoo commented on May 18, 2024

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False

from scikit-learn.

Comments (8)

lesteve commented on May 18, 2024

It kinds of remind me of #27307 and #27691. In general, copy=False means "it may avoid a copy" but it is not always possible. In other words, copy=False may reduce the memory usage, but the fact that the input array is modified should not be relied upon.

I think there are two complementary ways to improve the situation:

change the docstring to a similar wording as in #27691
if the code changes that you suggest improves the situation, why not? I have to say I am not sure we have some tests to check the copy=False behaviour.

from scikit-learn.

aivarsoo commented on May 18, 2024

Regarding the first way. In this case, I guess the docstring should qualify that only X is affected by copy, but y is always copied in addition to the issues with sparse matrices.

I am a bit biased here as I want y to have the same behavior of X for my application, but I am trying to understand if there's a reason to always copy y?

from scikit-learn.

lesteve commented on May 18, 2024

It would help a lot if you could explain your use case a bit 🙏. In particular how does it affect you that y is copied and not X when using copy=False?

from scikit-learn.

aivarsoo commented on May 18, 2024

This issue may go in a bit different direction now :) Let me know if i should separate it into a new one.

I am updating a third party sklearn plug-in for feature selection, where we want to add a possibility for feature selection using conditional mutual information based on the paper. Note that unlike the paper itself, our solution will use a proprietary solver, but the plug-in itself is open-source.

Mathematically we will aim to solve

$$ \max\limits_{\delta_i} \sum\limits_{i=1}^n I(x_i; y) \delta_i + \sum\limits_{i, j=1}^n I(x_i; y | x_j) \delta_i \delta_j $$

where $x_i$ are features, $y$ is the target variable, and the variables $\delta_i\in{0,1}$ are binary with an associated constraint $\sum \delta_i = K$ for a pre-defined value $K$. If the variable $\delta_i =1$ then we will pick the feature $i$.

The conditional mutual information algorithm will follow mostly this implementation, which is in turn based on the sklearn implementation for mutual information computation.

Essentially, I want to compute the values $I(x_i; y)$, which I can do using the public methods mutual_info_classif, mutual_info_regression from sklearn.feature_selection or a private method _compute_mi from sklearn.feature_selection._mutual_info.

If I use the public method mutual_info_regression, X and y will be rescaled if they correspond to a data of a non-discrete distribution. I will then need to rescale y again for computing $I(x_i; y | x_j)$. If X and y would have the same behavior, then I can avoid rescaling y twice and essentially using different data y for computing $I(x_i; y)$ and $I(x_i; y | x_j)$.
If we can make the method _compute_mi public, then I can use my custom function for computing the values of $I(x_i; y | x_j)$ and $I(x_i; y)$ simultaneously.

The second option would also allow for using the problem formulation described in the mRMR paper. Similar to an existing issue #8889 but using our solver. I was actually considering making the second option (making _compute_mi public) an issue too, but decided to start with a smaller issue.

I am happy to submit a PR on some of these issues.

from scikit-learn.

lesteve commented on May 18, 2024

Thanks for the details!

So my understanding is that you do something like this:

mutual_info_1 = mutual_info_regression(X, y, copy=False)
# make assumptions on how X and y have been modified and call `mutual_info_regression` again
mutual_info_2 = mutual_info_regression(..., copy=False)

How much do you care about using copy=False, in other words, does using copy=True have a big effect in your use case? One suggestion would be to use copy=True in the first mutual_info_regression call and then copy=False in the second one (if that really makes a difference).

My current understanding that may not be fully accurate: when using copy=False there is no guarantee on how your input values may change, basically you may avoid a copy at the cost of not being able to use your input array after the function call.

Even if copy=False makes a big difference in your use case, I would rather rely on a private function that on how X and y are modified when using copy=False. I don't think there is any guarantee on this and this may change without warnings, at least if we rename or remove _compute_mi you will get an error.

Then there is the question on _compute_mi public, I am not very familiar with this part of scikit-learn, so I don't have an informed opinion. One thing to have in mind is that having an additional public function could lead users to confusion and/or not using the correct public function.

from scikit-learn.

aivarsoo commented on May 18, 2024

Thanks for the response!

An example code would be something like

# We rescale X and y in the sklearn method computing I(x_i; y) for all i:
mutual_info = mutual_info_regression(X, y, copy=False)
# We re-use the rescaled X and y in our method computing I(x_i; y| x_j) 
# for all i and j , where we do not modify X or y
conditional_mutual_info = conditional_mutual_info(X, y)

Note that the mutual_info methods rescale and add noise to X and y, which I wouldn't want to do several times. Also I want to use the same X and y for all the computations/estimations.

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality? If so then I may need to look for a different solution anyway.

On another note, I think importing the private sklearn method wouldn't work in the long run for our plug-in.

from scikit-learn.

lesteve commented on May 18, 2024

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality?

Yes, I asked other maintainers inputs and basically there seems to be agreement on this: if you use sklearn_func(X, copy=False) don't reuse X because there is no guarantee on how its values have changed after the sklearn_func call.

I am going to label this issue as Documentation because I think the most reasonable thing to do would be to document this in the glossary. I guess adding a link to the glossary entry in all the function/classes that uses copy may be a good idea too but quite some work.

from scikit-learn.

aivarsoo commented on May 18, 2024

Many thanks!

from scikit-learn.

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False about scikit-learn HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs