GithubHelp home page GithubHelp logo

Comments (13)

piskvorky avatar piskvorky commented on August 15, 2024 1

Another resource: FLANN http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN has Python bindings.

They don't mention scalability, but considering it's a recent SW specialized for approx k-NN in high dim spaces, this ought to be as good as it gets.

from gensim.

piskvorky avatar piskvorky commented on August 15, 2024

More resources re. approx sim search:

from gensim.

jtmcmc avatar jtmcmc commented on August 15, 2024

Just curious considering http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/#survivors if you feel like there is a best library to integrate now if this was to be done (thinking of tackling this issue)

from gensim.

piskvorky avatar piskvorky commented on August 15, 2024

Hmm, I wonder if it makes sense to integrate some algo fully (as in, implement Annoy directly in Python/C/Cython). Not super difficult, but not trivial either.

Another option is to rely on Annoy as a 3rd party lib, no deep integration, and just make it easier to use the Annoy API from gensim (or vice versa). I remember Annoy was picky about the type of input it accepts etc, the API was a bit unintuitive, so working around that would be a plus. Plus, reliance on C++ and Boost can make Annoy hard to install for many users.

Tackling this 4-year-old issue will be welcome :)

from gensim.

jtmcmc avatar jtmcmc commented on August 15, 2024

Yes I see how annoy is a bit hard to implement. I've also found https://github.com/ryanrhymes/panns which maybe could be a better fit. I'm going to get annoy installed and try and do some comparisons. Alternatively the google correlate algorithm doesn't seem that complicated to implement so that could be promising as well.

from gensim.

piskvorky avatar piskvorky commented on August 15, 2024

I don't think the Annoy algo is that hard to implement. It's pretty straightforward IIRC.

I mean, Erik's C++ implementation is involved, because it's heavily optimized, goes for memory-mapping etc etc. But the algo itself is clean.

Either way, let me know how you progress. Would be great to finally have something efficient in gensim :)

from gensim.

jodaiber avatar jodaiber commented on August 15, 2024

This would be incredibly useful! Is there any update on approximate sim. search in gensim (i.e. is anyone working on it)?

from gensim.

piskvorky avatar piskvorky commented on August 15, 2024

The only update is, @erikbern (author of Annoy) left Spotify... but he still works on Annoy, somehow :)

On the other hand, Annoy has shed its dependency on Boost + got several cleanups, fixes and improvements recently, so it's become much more viable as a 3rd party lib.

I think I'd prefer to keep the brute force exact kNN in gensim (for small problems, <1M items) and integrate cleanly with Annoy's approximate kNN for larger datasets.

@jodaiber or do you have other ideas?

from gensim.

erikbern avatar erikbern commented on August 15, 2024

I'm all for integrating Annoy. Obv I'm biased though :).

I'm currently running some benchmarks that could be relevant: https://github.com/erikbern/ann-benchmarks

from gensim.

tmylk avatar tmylk commented on August 15, 2024

Annoy has been integrated in https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb

from gensim.

erikbern avatar erikbern commented on August 15, 2024

nice!

from gensim.

piskvorky avatar piskvorky commented on August 15, 2024

@tmylk can we change the tutorial to use a more meaningful dataset?

How about the GoogleNews word2vec model (3,000,000 x 300 matrix)? Lots of people use that.

from gensim.

tmylk avatar tmylk commented on August 15, 2024

I agree that it's a more illustrative example to show benefits of Annoy. It would look great in a blog post. For the tutorial we chose something that easily runs on a laptop.

from gensim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.