I am trying to use the ElastiKnnClient and I implemented as such while following the <

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

'failed to create query' request error when using NearestNeighborsQuery.L2LSH in ElastiKnnClient about elastiknn HOT 10 CLOSED

alexklibisz commented on August 26, 2024

'failed to create query' request error when using NearestNeighborsQuery.L2LSH in ElastiKnnClient

from elastiknn.

Comments (10)

alexklibisz commented on August 26, 2024 1

Hi, thanks for posting an issue.
If you want to use the L2LSH query, you need to use the L2LSH mapping.
Right now you're using the DenseFloat mapping which doesn't implement any logic for LSH.

There's more description of mapping/query compatibility here: http://elastiknn.klibisz.com/api/#mapping-and-query-compatibility

The error message isn't great though. I can probably improve that.

from elastiknn.

shawnchen63 commented on August 26, 2024

Oops, I did not read the compatibility carefully, my bad. I have managed to implement it and got it working. Thanks for replying!

However, I am not getting high recall values. Previously, I was using FAISS's LSH index on my dataset and managed to achieve quite high recall rates. I am quite new to LSH so I am not quite sure how to set the parameters no. of bands, rows and integer width. Previously when I was using FAISS, i vary the num_bits parameter to tune recall-query performance.

How do these parameters relate to each other?

from elastiknn.

alexklibisz commented on August 26, 2024

Glad it's working.

You can modify the LSH parameters when you create the mapping. These parameters will affect the recall and speed. The general effects are documented on the API page on the docs site. L2 LSH params are documented here: http://elastiknn.klibisz.com/api/#l2-lsh-mapping. For example, Number of bands. Sometimes called the number of tables or L. Generally, increasing the number of bands increases recall at the cost of additional computation.

And then you can modify the number of "candidates" when you run a search. See notes here: http://elastiknn.klibisz.com/api/#lsh-search-strategy. The rough idea is that a "candidate" is a vector that was matched using an approximate LSH search so you compute its exact similarity against the query vector. More candidates = higher recall = slower performance.

For now you'll just have to guess and check on these parameters. I'm working on a benchmark that will report the effect of these parameters for about a dozen common datasets.

from elastiknn.

shawnchen63 commented on August 26, 2024

Alright, thanks for providing the information.So far I am not getting a good enough recall at a decent query speed. I will continue to experiment nonetheless. Thanks!

from elastiknn.

alexklibisz commented on August 26, 2024

@shawnchen63 Thanks for checking out the plugin anyways :).
If you're interested, there are some performance optimizations, especially for exact KNN, in the latest release: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE18

My current priority is actually speeding up the LSH queries. It turns out the main bottleneck is actually the term query that identifies vectors sharing hashes with the query vector. Ironically this is exactly what should make it faster than an exact query. More in this issue #76 and this PR #78 . I think there are still some pretty low-hanging fruit for optimization.

Right now my suggestion would be to generally use more shards, fewer LSH "bands", fewer LSH "rows", and more "candidates" in the query.

from elastiknn.

shawnchen63 commented on August 26, 2024

Interesting. You mentioned exact KNN so I tested the recall when using exact search in the plugin. I got a lower recall (10% lower) than when I search directly in ES using script_score. It could be an error in my code but for now I don't have time to check. This is my implementation of the exact search using the client api:

...
mapping = Mapping.DenseFloat(dims=dim)
...
qvec = Vec.DenseFloat(embedding.tolist())
query = NearestNeighborsQuery.Exact(field, qvec, Similarity.L2)
...

compared to directly search using script_score in ES:

query = {
          "size" : K,
          "query": {
            "script_score": {
              "query" : {
                "match_all" : {}
             },
              "script": {
                "source": "1 / (1 + l2norm(params.queryVector, 'my_vector'))",
                "params": {
                  "queryVector": embedding.tolist()
                }
              }
            }
          }
        }

Theoretically, there shouldn't be a difference in this method. Or am I wrong?

from elastiknn.

alexklibisz commented on August 26, 2024

Interesting. You mentioned exact KNN so I tested the recall when using exact search in the plugin. I got a lower recall (10% lower) than when I search directly in ES using script_score. It could be an error in my code but for now I don't have time to check. This is my implementation of the exact search using the client api:
...
mapping = Mapping.DenseFloat(dims=dim)
...
qvec = Vec.DenseFloat(embedding.tolist())
query = NearestNeighborsQuery.Exact(field, qvec, Similarity.L2)
...
compared to directly search using script_score in ES:
query = {
          "size" : K,
          "query": {
            "script_score": {
              "query" : {
                "match_all" : {}
             },
              "script": {
                "source": "1 / (1 + l2norm(params.queryVector, 'my_vector'))",
                "params": {
                  "queryVector": embedding.tolist()
                }
              }
            }
          }
        }
Theoretically, there shouldn't be a difference in this method. Or am I wrong?

Interesting.. there shouldn't be a meaningful difference. There's actually a test suite that runs on every PR that checks this.

One difference is that Elastiknn uses 1 / (l2 distance + 1e-6) for the score, instead of 1 / (l2 distance + 1) which you're using in your script. Maybe that has some effect in that the denominator can be < 1 in Elastiknn and whereas your script score has a lower bound at 1?

Another thing that might cause a difference, is that some vectors will have effectively-equivalent similarity to the query vector. Meaning, l2norm(q, a) and l2norm(q, b) could be equivalent up to five or six decimal points. So rounding errors could invalidate the exact order at the boundaries. For example, for k=10, maybe the correct ordering of vectors 8-11 is v99,v22,v33,v44, and Elastiknn ordered them v99,v22,v44,v33. That would lower the recall even though v33 and v44 are basically the same distance from your query vector. So a quick sanity check might be to request, say, k=15 and then compute the recall as recall(ground_truth_10_neighbors, returned_15_neighbors).

I'm curious what you find! :)

from elastiknn.

alexklibisz commented on August 26, 2024

Also, maybe there's some solid reasoning for using 1 / (l2 distance + 1)? I'm open to changing it. I suppose it bounds the score in [0,1] which is a bit tidier than [0,1e6].

from elastiknn.

alexklibisz commented on August 26, 2024

@shawnchen63 If you're interested in trying it out, I've added an option to the LSH queries called useMLTQuery. It's basically using a heuristic based on index statistics to figure out which LSH hashes are worth retrieving and which ones aren't. So if you set bands to, say, 600, it might pick the ~300 most promising ones and skip the rest. I haven't gotten to test it extensively yet on real datasets, but it seems to run faster (query run time is basically a function of the number of hashes) and produce only slightly lower recall for the same settings (bands, etc.).
More in the release notes: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE20

from elastiknn.

alexklibisz commented on August 26, 2024

Closing this as the problems discussed here are resolved.

from elastiknn.

'failed to create query' request error when using NearestNeighborsQuery.L2LSH in ElastiKnnClient about elastiknn HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs