Comments (9)
Given that providing the same value for all samples is a bit weird, does anyone remember why this is supported in the first place?
I guess we need to edit the code to raise an error and re-run the tests to check if passing a scalar int or float is used anywhere in our code base, and in particular as part of public facing API.
If it's not used anywhere, maybe we can deprecate this.
from scikit-learn.
Indeed, PR welcome to fix the documentation.
from scikit-learn.
Indeed, PR welcome to fix the documentation.
Ok, will start working on it later today
from scikit-learn.
Given that providing the same value for all samples is a bit weird, does anyone remember why this is supported in the first place?
from scikit-learn.
My guess is that _check_sample_weight
supports int/float weights because it is used across many different estimators. For some of them it might make sense to use uniform integer weights. So we'd have to specifically check in LinearRegression
if someone passed int/float weights.
Or maybe we leave this undocumented and don't introduce a check. Nothing bad/incorrect happens if you pass a int/float as weight, it is just unusual. Maybe we need to balance this with helping users who passed a int/float by mistake. For those it would be useful to raise an error.
from scikit-learn.
Following @betatim's point that this technically works, but was probably a mistake, an intermediate solution could be to just warn for now:
Warning: a single number {sample_weight} was provided as "sample_weight", each sample will receive the same weight of {sample_weight}.
from scikit-learn.
Raising an error in _check_sample_weight
(which is not used everywhere so results might not show the whole picture) when passing a float makes the following tests fail:
sklearn/linear_model/_glm/tests/test_glm.py::test_sample_weights_validation FAILED [ 8%]
sklearn/linear_model/tests/test_base.py::test_raises_value_error_if_sample_weights_greater_than_1d[2-3] FAILED [ 16%]
sklearn/linear_model/tests/test_base.py::test_raises_value_error_if_sample_weights_greater_than_1d[3-2] FAILED [ 25%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-False-None] FAILED [ 33%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-False-csr_matrix] FAILED [ 41%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-False-csr_array] FAILED [ 50%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-True-None] FAILED [ 58%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-True-csr_matrix] FAILED [ 66%]
sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency[42-True-csr_array] FAILED [ 75%]
sklearn/linear_model/tests/test_ridge.py::test_raises_value_error_if_sample_weights_greater_than_1d FAILED [ 83%]
sklearn/utils/tests/test_validation.py::test_check_sample_weight FAILED
Some tests just check that passing a float works. Some check consistency, i.e. passing a float has the same effect as passing None.
from scikit-learn.
To me keeping support for float can be handy if we want to extend our tests for consistent sample weight behavior as discussed in e.g. #15657. But it's not a big problem since we just pass an array with all equal elements.
There a use case where I can think supporting a float is convenient: if you want to learn on minibatches where for some reason you want to apply the same weight for all elements in a batch but different between batches. Again, doable by passing an array with all equal elements, but less convenient.
from scikit-learn.
Some tests just check that passing a float works. Some check consistency, i.e. passing a float has the same effect as passing None.
Is that something that should be done, i.e., use the _check_sample_wieght
function each time the fit
method of an estimator accepts the sample_weight
parameter? Suppose that could help ensure consistency of the API
I think if this was being done from scratch my inclination would be to only accept either the array
or None
. Seems a tiny bit more coherent with the underlying concept. But since float
is currently accepted, also don't see much harm in sticking with that, since nothing is inherently wrong about the current logic.
Perhaps the only downside may be undetected bugs that could occur. Not sure if any of the maintainers as some intuition for how likely you believe a bug of this kind would be to occur in someone's code? Doesn't seem too likely to me, but I'm quite new to open-source
from scikit-learn.
Related Issues (20)
- Consider bumping C standard in meson.build from C99 to C17 HOT 2
- Add support for Python 3.13 free-threaded build HOT 14
- Documentation says scikit-learn latest versions still supports Python 3.8 HOT 2
- Add zero_division for single class prediction in MCC HOT 2
- Saving and loading calibratedclassifierCV model (ensemble) HOT 3
- What about negative coefficients / feature weights? HOT 10
- MemoryLeak in `LogisticRession` HOT 4
- StratifiedShuffleSplit requires three copies of a lower class, rather than 2 HOT 2
- Add "scoring" argument to ``score`` HOT 20
- Enhancement: Add Summary Output for Linear Regression Models HOT 2
- KFold(n_samples=n) not equivalent to LeaveOneOut() cv in CalibratedClassifierCV() HOT 3
- ⚠️ CI failed on Wheel builder (last failure: May 13, 2024) ⚠️
- Incorrect documented output shape for `predict` method of linear models when `n_targets` > 1 HOT 2
- Pyodide build broken by updating meson.build to C17 HOT 3
- MultiOutputClassifier does not rely on estimator to provide pairwise tag HOT 2
- Using decision boundary display to plot the relationship between any 2 features if model is fitted to more than 2 features HOT 1
- TunedThreasholdClassifierCV failing inside a SearchCV object HOT 1
- DOC Investigate scipy-doctest for better doctests
- Issue with int32/int64 dtype with NumPy 2.0
- Improve `FunctionTransformer` diagram representation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-learn.