Comments (9)
from elastiknn.
Please tell if I could assist in a meaningful manner :)
I've been experimenting with queries a little bit.
A query like this seems to work:
{
"_source": [
"caption",
"date"
],
"query": {
"bool": {
"filter": {
"range": {
"date": {
"gte": "2020-06-01T00:00:00",
"lte": "2020-06-03T00:00:00"
}
}
},
"must": [
{
"elastiknn_nearest_neighbors": {
"field": "my_vec",
"vec": {
"values": [
1.12032473,
0.17005706,
...
]
},
"model": "lsh",
"similarity": "angular",
"candidates": 100
}
}
]
}
},
"size": 100
}
.. but there are some things I'm not sure about:
- Is "filter" clause being executed before elastiknn computations?
- Sometimes query crashes with "index_out_of_bounds" exception, but after a few consecutive queries with same payload it just starts to work. I cannot explain this behavior.
- I am not quite sure how "size" and "candidates" clauses interact with each other and the number of returned hits does not look in line with those.
from elastiknn.
Hi Alex,
We would LOVE this feature.
Do you have any plans to work on this in foreseeable future?
Perhaps you could suggest a "workaround" for this in the meanwhile?
btw, did some comparisons between your plugin and native ES dense_vector implementation, and apparently the plugin's query speed is approximately 3 times higher - great work!
from elastiknn.
I got a decent start on it this morning.
This doc makes it sound like ES might implement this automatically for each query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
But it seems that each query is responsible for implementing the filter
on its own.
I'll continue later today or tomorrow.
from elastiknn.
This feedback is definitely meaningful. I have a few questions/comments:
What's the intended behavior for using a knn query in a must
clause? Right now all of the knn queries return a float representing the similarity, whereas I believe a must
clause should return a boolean.
Candidates should be >= size. Candidates is also only relevant for LSH queries. If we get the pre-filter working, and your filter narrows the docs down to < ~10K documents, then an exact query (i.e. not LSH) should respond in under 100ms. The LSH queries are still not performing as well as I'd like.
Someone else emailed me with a similar index_out_of_bounds exception but I wasn't able to reproduce it. They were also using the query in an unexpected (to me) way, so hopefully we can track that down.
from elastiknn.
What's the intended behavior for using a knn query in a must clause?
I might be getting it wrong, but my line of thought was that everything would get evaluated to true while "filter" clause would do its job.
The LSH queries are still not performing as well as I'd like.
Offtopic, but I'd like to share my observations with LSH.
Ran on OpenDistro 1.8.0 which has elasticsearch-oss-7.7.0 under the hood.
"model": "lsh",
"similarity": "angular",
"bands": 99,
"rows": 1,
"width": 3
# vectors: 1 million
dimensionality: 256
avg query time for 100 candidates: 749 ms
avg query time for 1000 candidates: 1076 ms
avg query time for 10000 candidates: 2417 ms
# vectors: 1 million
dimensionality: 1280
avg query time for 100 candidates: 1262 ms
avg query time for 1000 candidates: 3689 ms
avg query time for 10000 candidates: 9660 ms
# vectors: 11.6 million
dimensionality: 256
avg query time for 100 candidates: 16610 ms
avg query time for 1000 candidates: 19894 ms
avg query time for 10000 candidates: 29535 ms
For my checks I've also played with changing maximum ES heap size values to see if it influences the query time - it does not.
from elastiknn.
Thanks for the timings. I'll focus on getting the filter part of the query working now.
A bit more on performance in general:
It makes sense heap shouldn't affect recall and timing. The query time for exact queries is bottlenecked by raw floating point computation. Reading the vectors used to be the bottleneck, but I fixed that by using a lower-level serialization method internally.
LSH is currently bottlenecked by building the collection of docs that match a given hash value. As a data point, for a corpus of ~1m glove word vectors and 100 hashes per vector, it seems that there are about 200k candidates in the corpus that match >= 1 hash for every query vector. I'm doing some research to try to speed that up, either with a different algorithm or with some other method in Lucene.
Generally your recall and speed should increase as a function of segments (which is usually a function of shards). The query runs in parallel on each segment, and it considers candidates on each segment. So if you have 10 segments and candidates = 100, you'll consider 1000 candidates with parallelism 10. 5 segments would only consider 500 candidates, so it's more likely you miss some true neighbors.
A bit more about that here: http://elastiknn.klibisz.com/api/#lsh-search-strategy
from elastiknn.
@BeardyBear I managed to get the pre-filtering working using a boolean query with a filter clause and and must clause, very similar to your example. I added a test and some documentation in #109 . When you get a chance, have a look at the docs, specifically here: https://github.com/alexklibisz/elastiknn/blob/f114b1348e99dea5c3c33a23f65fa23f84f6cbc3/docs/pages/api.md#running-nearest-neighbors-query-on-a-filtered-subset-of-documents
Once CI is passing I'll merge that PR and that will update the docs on the site.
Docs are updated: http://elastiknn.klibisz.com/api/#running-nearest-neighbors-query-on-a-filtered-subset-of-documents
from elastiknn.
I'll go ahead and close this out but feel free to re-open or just respond if it doesn't work for you @BeardyBear . Also, thanks for your feedback.
from elastiknn.
Related Issues (20)
- Try vectors from Project Panama for LSH operations HOT 3
- can't create a mapping HOT 1
- Try quick select algorithm for KthGreatest implementation HOT 4
- Try resampling vectors to speed up L2LshModel
- Try getting rid of HashAndFreq to minimize allocations HOT 1
- Try re-using threadlocal arrays in ArrayHitCounter HOT 2
- Try caching the query vector's FloatVector segments when computing distance HOT 2
- Get Fashion Mnist 96% recall up to 200 queries/second HOT 2
- Try using a byte array in ArrayHitCounter instead of a short array
- Try Lucene VectorUtil instead/alongside PanamaFloatVectorOps HOT 1
- Try index sorting to reduce number of shards/segments accessed HOT 2
- Kibana does not show the data of elastiknn_sparse_bool_vector HOT 1
- Q&A: Scale effects HOT 2
- Support range queries (neighbors within some distance) HOT 1
- Try using Lucene IntIntHashMap to speedup and reduce memory usage of top-K counting HOT 1
- Hope to support version 7.17.20, later 7.17.x can be downloaded HOT 1
- a problem about hybrid search HOT 3
- cannot create runtime field during seach HOT 1
- Using bitnami/elasticsearch: 8.14.1 add elastiknn I start an error HOT 1
- Support for index patterns
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elastiknn.