GithubHelp home page GithubHelp logo

starkdg / hftrie Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 1.0 49 KB

index binary vectors for efficient nearest neighbor search

License: Other

CMake 3.44% C++ 96.56%
binary-vector nearest-neighbor-search trie vector-database hamming-space

hftrie's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

hadryan

hftrie's Issues

Performance idea: chunk values in leaf nodes and do linear lookup in that

Hi there,

I'm opening this issue as a follow-up to your comment on HN a few months ago.

My own cppbktree Python module implements a straight-forward BK-tree, which I found to be too slow, especially when trying to look up millions of hashes in a database of billions. Exact lookup is not a problem but lookup with a distance of around 10-16 out of 64, which I would need, is.

I found some motivation to revisit this again because I was wondering how fast I could perform with a simple linear lookup maybe even using SIMD because in the case of large distances, it seemed that almost all BK tree nodes would be visited anyway. It turns out that my hand-crafted AVX SIMD loop is 10% slower than a simple loop over 64-bit values using xor and popcnt. But even that loop is ~300x faster than the BK tree version for large distances and large datasets. The following figure shows the simple linear lookup in solid lines competing with the BK tree. The BK-tree can only compete for very short distances.

compare-scalings-cppbktree-linear-lookup

The idea now would be to mix a BK tree with linear lookup by implementing a kind of chunking per leaf node. This way, I can combine the 1000x faster speed for large distances with the fast BK-tree speed for near-exact lookups. This works quite well:

compare-scalings-cppbktree-vs-chunked

I would be very interested if this same idea can be applied to your code. It has a CHUNKSIZE parameter but it seems to be something different. In the end, it should improve performance if there are leaf nodes containing around 10-100k elements that are then linearly searched. (I'm not sure about this but I think this would correspond to a chunk size of 16-19 bits, which is what your CHUNKSIZE variable represents, if I understand it correctly. But that value would have to be only applied to the leaf nodes while the intermediary nodes keep using a CHUNKSIZE of 4.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.