GithubHelp home page GithubHelp logo

Comments (10)

alexklibisz avatar alexklibisz commented on August 26, 2024 1

Hi, thanks for posting an issue.
If you want to use the L2LSH query, you need to use the L2LSH mapping.
Right now you're using the DenseFloat mapping which doesn't implement any logic for LSH.

There's more description of mapping/query compatibility here: http://elastiknn.klibisz.com/api/#mapping-and-query-compatibility

The error message isn't great though. I can probably improve that.

from elastiknn.

shawnchen63 avatar shawnchen63 commented on August 26, 2024

Oops, I did not read the compatibility carefully, my bad. I have managed to implement it and got it working. Thanks for replying!

However, I am not getting high recall values. Previously, I was using FAISS's LSH index on my dataset and managed to achieve quite high recall rates. I am quite new to LSH so I am not quite sure how to set the parameters no. of bands, rows and integer width. Previously when I was using FAISS, i vary the num_bits parameter to tune recall-query performance.

How do these parameters relate to each other?

from elastiknn.

alexklibisz avatar alexklibisz commented on August 26, 2024

Glad it's working.

You can modify the LSH parameters when you create the mapping. These parameters will affect the recall and speed. The general effects are documented on the API page on the docs site. L2 LSH params are documented here: http://elastiknn.klibisz.com/api/#l2-lsh-mapping. For example, Number of bands. Sometimes called the number of tables or L. Generally, increasing the number of bands increases recall at the cost of additional computation.

And then you can modify the number of "candidates" when you run a search. See notes here: http://elastiknn.klibisz.com/api/#lsh-search-strategy. The rough idea is that a "candidate" is a vector that was matched using an approximate LSH search so you compute its exact similarity against the query vector. More candidates = higher recall = slower performance.

For now you'll just have to guess and check on these parameters. I'm working on a benchmark that will report the effect of these parameters for about a dozen common datasets.

from elastiknn.

shawnchen63 avatar shawnchen63 commented on August 26, 2024

Alright, thanks for providing the information.So far I am not getting a good enough recall at a decent query speed. I will continue to experiment nonetheless. Thanks!

from elastiknn.

alexklibisz avatar alexklibisz commented on August 26, 2024

@shawnchen63 Thanks for checking out the plugin anyways :).
If you're interested, there are some performance optimizations, especially for exact KNN, in the latest release: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE18

My current priority is actually speeding up the LSH queries. It turns out the main bottleneck is actually the term query that identifies vectors sharing hashes with the query vector. Ironically this is exactly what should make it faster than an exact query. More in this issue #76 and this PR #78 . I think there are still some pretty low-hanging fruit for optimization.

Right now my suggestion would be to generally use more shards, fewer LSH "bands", fewer LSH "rows", and more "candidates" in the query.

from elastiknn.

shawnchen63 avatar shawnchen63 commented on August 26, 2024

Interesting. You mentioned exact KNN so I tested the recall when using exact search in the plugin. I got a lower recall (10% lower) than when I search directly in ES using script_score. It could be an error in my code but for now I don't have time to check. This is my implementation of the exact search using the client api:

...
mapping = Mapping.DenseFloat(dims=dim)
...
qvec = Vec.DenseFloat(embedding.tolist())
query = NearestNeighborsQuery.Exact(field, qvec, Similarity.L2)
...

compared to directly search using script_score in ES:

query = {
          "size" : K,
          "query": {
            "script_score": {
              "query" : {
                "match_all" : {}
             },
              "script": {
                "source": "1 / (1 + l2norm(params.queryVector, 'my_vector'))",
                "params": {
                  "queryVector": embedding.tolist()
                }
              }
            }
          }
        }

Theoretically, there shouldn't be a difference in this method. Or am I wrong?

from elastiknn.

alexklibisz avatar alexklibisz commented on August 26, 2024

Interesting. You mentioned exact KNN so I tested the recall when using exact search in the plugin. I got a lower recall (10% lower) than when I search directly in ES using script_score. It could be an error in my code but for now I don't have time to check. This is my implementation of the exact search using the client api:

...
mapping = Mapping.DenseFloat(dims=dim)
...
qvec = Vec.DenseFloat(embedding.tolist())
query = NearestNeighborsQuery.Exact(field, qvec, Similarity.L2)
...

compared to directly search using script_score in ES:

query = {
          "size" : K,
          "query": {
            "script_score": {
              "query" : {
                "match_all" : {}
             },
              "script": {
                "source": "1 / (1 + l2norm(params.queryVector, 'my_vector'))",
                "params": {
                  "queryVector": embedding.tolist()
                }
              }
            }
          }
        }

Theoretically, there shouldn't be a difference in this method. Or am I wrong?

Interesting.. there shouldn't be a meaningful difference. There's actually a test suite that runs on every PR that checks this.

One difference is that Elastiknn uses 1 / (l2 distance + 1e-6) for the score, instead of 1 / (l2 distance + 1) which you're using in your script. Maybe that has some effect in that the denominator can be < 1 in Elastiknn and whereas your script score has a lower bound at 1?

Another thing that might cause a difference, is that some vectors will have effectively-equivalent similarity to the query vector. Meaning, l2norm(q, a) and l2norm(q, b) could be equivalent up to five or six decimal points. So rounding errors could invalidate the exact order at the boundaries. For example, for k=10, maybe the correct ordering of vectors 8-11 is v99,v22,v33,v44, and Elastiknn ordered them v99,v22,v44,v33. That would lower the recall even though v33 and v44 are basically the same distance from your query vector. So a quick sanity check might be to request, say, k=15 and then compute the recall as recall(ground_truth_10_neighbors, returned_15_neighbors).

I'm curious what you find! :)

from elastiknn.

alexklibisz avatar alexklibisz commented on August 26, 2024

Also, maybe there's some solid reasoning for using 1 / (l2 distance + 1)? I'm open to changing it. I suppose it bounds the score in [0,1] which is a bit tidier than [0,1e6].

from elastiknn.

alexklibisz avatar alexklibisz commented on August 26, 2024

@shawnchen63 If you're interested in trying it out, I've added an option to the LSH queries called useMLTQuery. It's basically using a heuristic based on index statistics to figure out which LSH hashes are worth retrieving and which ones aren't. So if you set bands to, say, 600, it might pick the ~300 most promising ones and skip the rest. I haven't gotten to test it extensively yet on real datasets, but it seems to run faster (query run time is basically a function of the number of hashes) and produce only slightly lower recall for the same settings (bands, etc.).
More in the release notes: https://github.com/alexklibisz/elastiknn/releases/tag/0.1.0-PRE20

from elastiknn.

alexklibisz avatar alexklibisz commented on August 26, 2024

Closing this as the problems discussed here are resolved.

from elastiknn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.