Comments (10)
I guess you are aware of my adversion to creating new objects.
The rational for proposing a new object is that kernel matrices are symmetric. So it seem that what we are really missing is a symmetric matrix in scipy right? Is the gain for the scikit large enough to justify creating our own?
from scikit-learn.
Yes I'm getting more and more aware of it ;-)
Shape-aware matrices would be indeed really great in Scipy (maybe a nice proposal for a GSOC?). For example, the inverse of a triangular matrix is also triangular...
If there's a solution that has all the advantages I mentioned, why not... The most important are efficient storage and built-in cache. Does joblib have good support for caching individual entries in a kernel matrix (in-memory cache)?
I agree with the principle of getting things done upstream whenever possible but it has 2 problems:
- it takes one year or more for a Scipy release to make it to mainstream distributions like Ubuntu
- it will likely take more time to get the code reviewed and accepted since the code will be more general and abstract
Let's hear other people's opinion about a kernel object.
from scikit-learn.
Joblib does not have at the moment in-memory cache (and it does not implement cache replacement policy, because the branch in which the usage patterns is tracked suffers from race conditions or slow downs when used in multiprocessing environments. However, its a feature planned. If I am to implement it, I think it will take 6 months, given how tied down I am.
With regards to upstream vs downstream, I agree that basing all our strategy on upstream is not a viable option. I was just checking that we agreed on the fact that this functionality should really be upstream. In which case we can code it in a way that it gets integrated upstream, submit a patch and backport it in the scikit. We have already done that a few times (for instance the connected component part).
from scikit-learn.
As Gael, I believe this should be contributed upstream. Having useful triangular packed storage involves a (big) number of BLAS routines and we cannot afford to ship all of them :-(
from scikit-learn.
BLAS routines will help make elementary computations such as dot or norm be done directly in the triangular packed format but that's not all there's to it.
-
Say I want to implement KernelPerceptron. When I call K[i, j] from inside fit, if the max cache size is exceeded, I would like to transparently delete the oldest entry in the kernel cache and recompute K[i, j]. This way KernelPerceptron doesn't need to handle cache: it is delegated to the Kernel object.
-
Say I want to implement a custom string kernel. kernel="precomputed" is out of question for a large n_samples. kernel=callable would allow the user to keep a reference to a cache object but the user would have to implement it by himself. A kernel object would offer a nice way to let the user define how to recompute the kernel entries on demand.
I hope we can have more kernel-based estimators in the scikit in the future. I think my proposal should help make things easier.
from scikit-learn.
An easier option for 2. --but less versatile-- would be to implement your string kernels inside libsvm (as others have done) and then you would get the caching for free, since libsvm implements its own cache (controlled via cache_size).
from scikit-learn.
If I understood correctly, there are three thing @mblondel wants to address:
-
Storing precomputed kernel matrices efficiently
-
Provide an easy interface for kernel caches
-
Make custom kernels easy to implement
Did I get that right?
I think storing positive definite matrices is not that important since it only reduces memory by a factor of 1/2. Also, as @fabianp pointed out, having blas for these matrices is out of the scope of sklearn.
Handling kernel caching is certainly interesting but I would not discuss this outside some concrete implementation. How to kernel-PCA and kernel-KMeans work at the moment? Do these algorithms need access to the full matrix any way?
I think how kernel caching works is really dependent on the algorithm. For SVMs, I think we don't need to reproduce the libSVM code. If you want to implement a kernel-perceptron, it is certainly interesting how you handle it there. I am not sure whether it will pay off factoring the caching out of the algorithm it is used in.
About custom kernels, I think this is quite independent of handling the kernel caching.
This was discussed in several places and I think having python callables to implement different kernels is certainly desirable. This would mean hacking libsvm, thought.
from scikit-learn.
Regarding kernel cache, I agree (in retrospect) that the best cache strategy is algorithm-dependent and so the caching must be handled in the algorithm direclty.
Regarding the factor of 1/2, I think it's still nice to save it. Also this is not only about memory: the kernels in the pairwise module actually make all the computations twice.
Closing the issue :)
from scikit-learn.
If the pairwise module computes values twice, that's a but in the pairwise module.
from scikit-learn.
In order to not do the computations twice, we need Cython utilities (like the ones that @dwf had started to create for euclidean_distances).
from scikit-learn.
Related Issues (20)
- ⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev (last failure: Apr 25, 2024) ⚠️
- usage of TimeSeriesSplit with cross_val_score HOT 7
- Feature request to use intermediate column transformer outputs
- ⚠️ CI failed on macOS.pylatest_conda_forge_mkl (last failure: Apr 24, 2024) ⚠️ HOT 1
- ⚠️ CI failed on Linux_Nightly_PyPy.pypy3 (last failure: Apr 29, 2024) ⚠️
- `TargetEncoder` should respect `sample_weights` HOT 1
- Configure OpenBLAS to use scikit-learn's OpenMP threadpool HOT 2
- ⚠️ CI failed on Wheel builder (last failure: Apr 26, 2024) ⚠️ HOT 1
- Add missing value support to ExtraTreesRegressor HOT 3
- Easily retrieve mapping from OrdinalEncoder HOT 3
- Automatically handle missing values in OrdinalEncoder HOT 1
- HistGradientBoostingClassifier raise error with monotonic constraints and categorical features HOT 3
- Validation step fails when using shared memory with `multiprocessing.managers.BaseManager` HOT 1
- Root mean square error function in the metric cannot be loaded HOT 1
- Parameter Validation Documentation? HOT 2
- RFC Move `_more_tags` to "developer API" via `__sklearn_tags__` HOT 4
- DOC Add Tidelift to sponsor list HOT 2
- mypy errors when depending on sklearn HOT 1
- Random Forest predict() does not produce reproducible results. random_state=42 HOT 2
- Undocumented change in tree_.value example for DecisionTreeClassifier between versions 1.3.2 and 1.4.2 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-learn.