Comments (8)
It kinds of remind me of #27307 and #27691. In general, copy=False
means "it may avoid a copy" but it is not always possible. In other words, copy=False
may reduce the memory usage, but the fact that the input array is modified should not be relied upon.
I think there are two complementary ways to improve the situation:
- change the docstring to a similar wording as in #27691
- if the code changes that you suggest improves the situation, why not? I have to say I am not sure we have some tests to check the
copy=False
behaviour.
from scikit-learn.
Regarding the first way. In this case, I guess the docstring should qualify that only X
is affected by copy
, but y
is always copied in addition to the issues with sparse matrices.
I am a bit biased here as I want y
to have the same behavior of X
for my application, but I am trying to understand if there's a reason to always copy y
?
from scikit-learn.
It would help a lot if you could explain your use case a bit 🙏. In particular how does it affect you that y
is copied and not X
when using copy=False
?
from scikit-learn.
This issue may go in a bit different direction now :) Let me know if i should separate it into a new one.
I am updating a third party sklearn plug-in for feature selection, where we want to add a possibility for feature selection using conditional mutual information based on the paper. Note that unlike the paper itself, our solution will use a proprietary solver, but the plug-in itself is open-source.
Mathematically we will aim to solve
where
The conditional mutual information algorithm will follow mostly this implementation, which is in turn based on the sklearn
implementation for mutual information computation.
Essentially, I want to compute the values mutual_info_classif
, mutual_info_regression
from sklearn.feature_selection
or a private method _compute_mi
from sklearn.feature_selection._mutual_info
.
- If I use the public method
mutual_info_regression
,X
andy
will be rescaled if they correspond to a data of a non-discrete distribution. I will then need to rescaley
again for computing$I(x_i; y | x_j)$ . IfX
andy
would have the same behavior, then I can avoid rescalingy
twice and essentially using different datay
for computing$I(x_i; y)$ and$I(x_i; y | x_j)$ . - If we can make the method
_compute_mi
public, then I can use my custom function for computing the values of$I(x_i; y | x_j)$ and$I(x_i; y)$ simultaneously.
The second option would also allow for using the problem formulation described in the mRMR paper. Similar to an existing issue #8889 but using our solver. I was actually considering making the second option (making _compute_mi
public) an issue too, but decided to start with a smaller issue.
I am happy to submit a PR on some of these issues.
from scikit-learn.
Thanks for the details!
So my understanding is that you do something like this:
mutual_info_1 = mutual_info_regression(X, y, copy=False)
# make assumptions on how X and y have been modified and call `mutual_info_regression` again
mutual_info_2 = mutual_info_regression(..., copy=False)
How much do you care about using copy=False
, in other words, does using copy=True
have a big effect in your use case? One suggestion would be to use copy=True
in the first mutual_info_regression
call and then copy=False
in the second one (if that really makes a difference).
My current understanding that may not be fully accurate: when using copy=False
there is no guarantee on how your input values may change, basically you may avoid a copy at the cost of not being able to use your input array after the function call.
Even if copy=False
makes a big difference in your use case, I would rather rely on a private function that on how X
and y
are modified when using copy=False
. I don't think there is any guarantee on this and this may change without warnings, at least if we rename or remove _compute_mi
you will get an error.
Then there is the question on _compute_mi
public, I am not very familiar with this part of scikit-learn, so I don't have an informed opinion. One thing to have in mind is that having an additional public function could lead users to confusion and/or not using the correct public function.
from scikit-learn.
Thanks for the response!
An example code would be something like
# We rescale X and y in the sklearn method computing I(x_i; y) for all i:
mutual_info = mutual_info_regression(X, y, copy=False)
# We re-use the rescaled X and y in our method computing I(x_i; y| x_j)
# for all i and j , where we do not modify X or y
conditional_mutual_info = conditional_mutual_info(X, y)
Note that the mutual_info methods rescale and add noise to X
and y
, which I wouldn't want to do several times. Also I want to use the same X
and y
for all the computations/estimations.
But if I understood you correctly using copy=False
doesn't give me much guarantees on what will happen to X
after the call to this method and I shouldn't rely on this functionality? If so then I may need to look for a different solution anyway.
On another note, I think importing the private sklearn
method wouldn't work in the long run for our plug-in.
from scikit-learn.
But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality?
Yes, I asked other maintainers inputs and basically there seems to be agreement on this: if you use sklearn_func(X, copy=False)
don't reuse X
because there is no guarantee on how its values have changed after the sklearn_func
call.
I am going to label this issue as Documentation because I think the most reasonable thing to do would be to document this in the glossary. I guess adding a link to the glossary entry in all the function/classes that uses copy
may be a good idea too but quite some work.
from scikit-learn.
Many thanks!
from scikit-learn.
Related Issues (20)
- Consider bumping C standard in meson.build from C99 to C17 HOT 2
- Add support for Python 3.13 free-threaded build HOT 14
- Documentation says scikit-learn latest versions still supports Python 3.8 HOT 2
- Add zero_division for single class prediction in MCC HOT 2
- Saving and loading calibratedclassifierCV model (ensemble) HOT 3
- What about negative coefficients / feature weights? HOT 10
- MemoryLeak in `LogisticRession` HOT 4
- StratifiedShuffleSplit requires three copies of a lower class, rather than 2 HOT 2
- Add "scoring" argument to ``score`` HOT 20
- Enhancement: Add Summary Output for Linear Regression Models HOT 2
- KFold(n_samples=n) not equivalent to LeaveOneOut() cv in CalibratedClassifierCV() HOT 3
- ⚠️ CI failed on Wheel builder (last failure: May 13, 2024) ⚠️
- Incorrect documented output shape for `predict` method of linear models when `n_targets` > 1 HOT 2
- Pyodide build broken by updating meson.build to C17 HOT 3
- MultiOutputClassifier does not rely on estimator to provide pairwise tag HOT 2
- Using decision boundary display to plot the relationship between any 2 features if model is fitted to more than 2 features HOT 1
- TunedThreasholdClassifierCV failing inside a SearchCV object HOT 1
- DOC Investigate scipy-doctest for better doctests
- Issue with int32/int64 dtype with NumPy 2.0
- Improve `FunctionTransformer` diagram representation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-learn.