For precomputed kernels, a square matrix is not an efficient way to store the kernel m

If I understood correctly, there are three thing <a class="user-mention notranslate" d

kernel object interface about scikit-learn HOT 10 CLOSED

mblondel commented on May 4, 2024

kernel object interface

from scikit-learn.

Comments (10)

GaelVaroquaux commented on May 4, 2024

I guess you are aware of my adversion to creating new objects.

The rational for proposing a new object is that kernel matrices are symmetric. So it seem that what we are really missing is a symmetric matrix in scipy right? Is the gain for the scikit large enough to justify creating our own?

from scikit-learn.

mblondel commented on May 4, 2024

Yes I'm getting more and more aware of it ;-)

Shape-aware matrices would be indeed really great in Scipy (maybe a nice proposal for a GSOC?). For example, the inverse of a triangular matrix is also triangular...

If there's a solution that has all the advantages I mentioned, why not... The most important are efficient storage and built-in cache. Does joblib have good support for caching individual entries in a kernel matrix (in-memory cache)?

I agree with the principle of getting things done upstream whenever possible but it has 2 problems:

it takes one year or more for a Scipy release to make it to mainstream distributions like Ubuntu
it will likely take more time to get the code reviewed and accepted since the code will be more general and abstract

Let's hear other people's opinion about a kernel object.

from scikit-learn.

GaelVaroquaux commented on May 4, 2024

Joblib does not have at the moment in-memory cache (and it does not implement cache replacement policy, because the branch in which the usage patterns is tracked suffers from race conditions or slow downs when used in multiprocessing environments. However, its a feature planned. If I am to implement it, I think it will take 6 months, given how tied down I am.

With regards to upstream vs downstream, I agree that basing all our strategy on upstream is not a viable option. I was just checking that we agreed on the fact that this functionality should really be upstream. In which case we can code it in a way that it gets integrated upstream, submit a patch and backport it in the scikit. We have already done that a few times (for instance the connected component part).

from scikit-learn.

fabianp commented on May 4, 2024

As Gael, I believe this should be contributed upstream. Having useful triangular packed storage involves a (big) number of BLAS routines and we cannot afford to ship all of them :-(

from scikit-learn.

mblondel commented on May 4, 2024

BLAS routines will help make elementary computations such as dot or norm be done directly in the triangular packed format but that's not all there's to it.

Say I want to implement KernelPerceptron. When I call K[i, j] from inside fit, if the max cache size is exceeded, I would like to transparently delete the oldest entry in the kernel cache and recompute K[i, j]. This way KernelPerceptron doesn't need to handle cache: it is delegated to the Kernel object.
Say I want to implement a custom string kernel. kernel="precomputed" is out of question for a large n_samples. kernel=callable would allow the user to keep a reference to a cache object but the user would have to implement it by himself. A kernel object would offer a nice way to let the user define how to recompute the kernel entries on demand.

I hope we can have more kernel-based estimators in the scikit in the future. I think my proposal should help make things easier.

from scikit-learn.

fabianp commented on May 4, 2024

An easier option for 2. --but less versatile-- would be to implement your string kernels inside libsvm (as others have done) and then you would get the caching for free, since libsvm implements its own cache (controlled via cache_size).

from scikit-learn.

amueller commented on May 4, 2024

If I understood correctly, there are three thing @mblondel wants to address:

Storing precomputed kernel matrices efficiently
Provide an easy interface for kernel caches
Make custom kernels easy to implement

Did I get that right?

I think storing positive definite matrices is not that important since it only reduces memory by a factor of 1/2. Also, as @fabianp pointed out, having blas for these matrices is out of the scope of sklearn.

Handling kernel caching is certainly interesting but I would not discuss this outside some concrete implementation. How to kernel-PCA and kernel-KMeans work at the moment? Do these algorithms need access to the full matrix any way?
I think how kernel caching works is really dependent on the algorithm. For SVMs, I think we don't need to reproduce the libSVM code. If you want to implement a kernel-perceptron, it is certainly interesting how you handle it there. I am not sure whether it will pay off factoring the caching out of the algorithm it is used in.

About custom kernels, I think this is quite independent of handling the kernel caching.
This was discussed in several places and I think having python callables to implement different kernels is certainly desirable. This would mean hacking libsvm, thought.

from scikit-learn.

mblondel commented on May 4, 2024

Regarding kernel cache, I agree (in retrospect) that the best cache strategy is algorithm-dependent and so the caching must be handled in the algorithm direclty.

Regarding the factor of 1/2, I think it's still nice to save it. Also this is not only about memory: the kernels in the pairwise module actually make all the computations twice.

Closing the issue :)

from scikit-learn.

amueller commented on May 4, 2024

If the pairwise module computes values twice, that's a but in the pairwise module.

from scikit-learn.

mblondel commented on May 4, 2024

In order to not do the computations twice, we need Cython utilities (like the ones that @dwf had started to create for euclidean_distances).

from scikit-learn.

kernel object interface about scikit-learn HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs