GithubHelp home page GithubHelp logo

Comments (10)

GaelVaroquaux avatar GaelVaroquaux commented on May 4, 2024

I guess you are aware of my adversion to creating new objects.

The rational for proposing a new object is that kernel matrices are symmetric. So it seem that what we are really missing is a symmetric matrix in scipy right? Is the gain for the scikit large enough to justify creating our own?

from scikit-learn.

mblondel avatar mblondel commented on May 4, 2024

Yes I'm getting more and more aware of it ;-)

Shape-aware matrices would be indeed really great in Scipy (maybe a nice proposal for a GSOC?). For example, the inverse of a triangular matrix is also triangular...

If there's a solution that has all the advantages I mentioned, why not... The most important are efficient storage and built-in cache. Does joblib have good support for caching individual entries in a kernel matrix (in-memory cache)?

I agree with the principle of getting things done upstream whenever possible but it has 2 problems:

  • it takes one year or more for a Scipy release to make it to mainstream distributions like Ubuntu
  • it will likely take more time to get the code reviewed and accepted since the code will be more general and abstract

Let's hear other people's opinion about a kernel object.

from scikit-learn.

GaelVaroquaux avatar GaelVaroquaux commented on May 4, 2024

Joblib does not have at the moment in-memory cache (and it does not implement cache replacement policy, because the branch in which the usage patterns is tracked suffers from race conditions or slow downs when used in multiprocessing environments. However, its a feature planned. If I am to implement it, I think it will take 6 months, given how tied down I am.

With regards to upstream vs downstream, I agree that basing all our strategy on upstream is not a viable option. I was just checking that we agreed on the fact that this functionality should really be upstream. In which case we can code it in a way that it gets integrated upstream, submit a patch and backport it in the scikit. We have already done that a few times (for instance the connected component part).

from scikit-learn.

fabianp avatar fabianp commented on May 4, 2024

As Gael, I believe this should be contributed upstream. Having useful triangular packed storage involves a (big) number of BLAS routines and we cannot afford to ship all of them :-(

from scikit-learn.

mblondel avatar mblondel commented on May 4, 2024

BLAS routines will help make elementary computations such as dot or norm be done directly in the triangular packed format but that's not all there's to it.

  1. Say I want to implement KernelPerceptron. When I call K[i, j] from inside fit, if the max cache size is exceeded, I would like to transparently delete the oldest entry in the kernel cache and recompute K[i, j]. This way KernelPerceptron doesn't need to handle cache: it is delegated to the Kernel object.

  2. Say I want to implement a custom string kernel. kernel="precomputed" is out of question for a large n_samples. kernel=callable would allow the user to keep a reference to a cache object but the user would have to implement it by himself. A kernel object would offer a nice way to let the user define how to recompute the kernel entries on demand.

I hope we can have more kernel-based estimators in the scikit in the future. I think my proposal should help make things easier.

from scikit-learn.

fabianp avatar fabianp commented on May 4, 2024

An easier option for 2. --but less versatile-- would be to implement your string kernels inside libsvm (as others have done) and then you would get the caching for free, since libsvm implements its own cache (controlled via cache_size).

from scikit-learn.

amueller avatar amueller commented on May 4, 2024

If I understood correctly, there are three thing @mblondel wants to address:

  1. Storing precomputed kernel matrices efficiently

  2. Provide an easy interface for kernel caches

  3. Make custom kernels easy to implement

Did I get that right?

I think storing positive definite matrices is not that important since it only reduces memory by a factor of 1/2. Also, as @fabianp pointed out, having blas for these matrices is out of the scope of sklearn.

Handling kernel caching is certainly interesting but I would not discuss this outside some concrete implementation. How to kernel-PCA and kernel-KMeans work at the moment? Do these algorithms need access to the full matrix any way?
I think how kernel caching works is really dependent on the algorithm. For SVMs, I think we don't need to reproduce the libSVM code. If you want to implement a kernel-perceptron, it is certainly interesting how you handle it there. I am not sure whether it will pay off factoring the caching out of the algorithm it is used in.

About custom kernels, I think this is quite independent of handling the kernel caching.
This was discussed in several places and I think having python callables to implement different kernels is certainly desirable. This would mean hacking libsvm, thought.

from scikit-learn.

mblondel avatar mblondel commented on May 4, 2024

Regarding kernel cache, I agree (in retrospect) that the best cache strategy is algorithm-dependent and so the caching must be handled in the algorithm direclty.

Regarding the factor of 1/2, I think it's still nice to save it. Also this is not only about memory: the kernels in the pairwise module actually make all the computations twice.

Closing the issue :)

from scikit-learn.

amueller avatar amueller commented on May 4, 2024

If the pairwise module computes values twice, that's a but in the pairwise module.

from scikit-learn.

mblondel avatar mblondel commented on May 4, 2024

In order to not do the computations twice, we need Cython utilities (like the ones that @dwf had started to create for euclidean_distances).

from scikit-learn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.