The hftrie from starkdg

hftrie's People

Stargazers

Watchers

hftrie's Issues

Performance idea: chunk values in leaf nodes and do linear lookup in that

Hi there,

I'm opening this issue as a follow-up to your comment on HN a few months ago.

My own cppbktree Python module implements a straight-forward BK-tree, which I found to be too slow, especially when trying to look up millions of hashes in a database of billions. Exact lookup is not a problem but lookup with a distance of around 10-16 out of 64, which I would need, is.

I found some motivation to revisit this again because I was wondering how fast I could perform with a simple linear lookup maybe even using SIMD because in the case of large distances, it seemed that almost all BK tree nodes would be visited anyway. It turns out that my hand-crafted AVX SIMD loop is 10% slower than a simple loop over 64-bit values using xor and popcnt. But even that loop is ~300x faster than the BK tree version for large distances and large datasets. The following figure shows the simple linear lookup in solid lines competing with the BK tree. The BK-tree can only compete for very short distances.

The idea now would be to mix a BK tree with linear lookup by implementing a kind of chunking per leaf node. This way, I can combine the 1000x faster speed for large distances with the fast BK-tree speed for near-exact lookups. This works quite well:

I would be very interested if this same idea can be applied to your code. It has a CHUNKSIZE parameter but it seems to be something different. In the end, it should improve performance if there are leaf nodes containing around 10-100k elements that are then linearly searched. (I'm not sure about this but I think this would correspond to a chunk size of 16-19 bits, which is what your CHUNKSIZE variable represents, if I understand it correctly. But that value would have to be only applied to the leaf nodes while the intermediary nodes keep using a CHUNKSIZE of 4.)

Recommend Projects

starkdg / hftrie Goto Github PK

hftrie's People

Stargazers

Watchers

Forkers

hftrie's Issues

Performance idea: chunk values in leaf nodes and do linear lookup in that

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs