GithubHelp home page GithubHelp logo

qdrant / vector-db-benchmark Goto Github PK

View Code? Open in Web Editor NEW
236.0 7.0 61.0 978 KB

Framework for benchmarking vector search engines

Home Page: https://qdrant.tech/benchmarks/

License: Apache License 2.0

Python 75.80% Shell 17.80% Dockerfile 0.39% Jupyter Notebook 6.00%
benchmark vector-search vector-search-engine vector-database

vector-db-benchmark's Introduction

vector-db-benchmark

Screenshot from 2022-08-23 14-10-01

View results

There are various vector search engines available, and each of them may offer a different set of features and efficiency. But how do we measure the performance? There is no clear definition and in a specific case you may worry about a specific thing, while not paying much attention to other aspects. This project is a general framework for benchmarking different engines under the same hardware constraints, so you can choose what works best for you.

Running any benchmark requires choosing an engine, a dataset and defining the scenario against which it should be tested. A specific scenario may assume running the server in a single or distributed mode, a different client implementation and the number of client instances.

How to run a benchmark?

Benchmarks are implemented in server-client mode, meaning that the server is running in a single machine, and the client is running on another.

Run the server

All engines are served using docker compose. The configuration is in the servers.

To launch the server instance, run the following command:

cd ./engine/servers/<engine-configuration-name>
docker compose up

Containers are expected to expose all necessary ports, so the client can connect to them.

Run the client

Install dependencies:

pip install poetry
poetry install

Run the benchmark:

Usage: run.py [OPTIONS]

  Example: python3 -m run --engines *-m-16-* --datasets glove-*

Options:
  --engines TEXT                  [default: *]
  --datasets TEXT                 [default: *]
  --host TEXT                     [default: localhost]
  --skip-upload / --no-skip-upload
                                  [default: no-skip-upload]
  --install-completion            Install completion for the current shell.
  --show-completion               Show completion for the current shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Command allows you to specify wildcards for engines and datasets. Results of the benchmarks are stored in the ./results/ directory.

How to update benchmark parameters?

Each engine has a configuration file, which is used to define the parameters for the benchmark. Configuration files are located in the configuration directory.

Each step in the benchmark process is using a dedicated configuration's path:

  • connection_params - passed to the client during the connection phase.
  • collection_params - parameters, used to create the collection, indexing parameters are usually defined here.
  • upload_params - parameters, used to upload the data to the server.
  • search_params - passed to the client during the search phase. Framework allows multiple search configurations for the same experiment run.

Exact values of the parameters are individual for each engine.

How to register a dataset?

Datasets are configured in the datasets/datasets.json file. Framework will automatically download the dataset and store it in the datasets directory.

How to implement a new engine?

There are a few base classes that you can use to implement a new engine.

  • BaseConfigurator - defines methods to create collections, setup indexing parameters.
  • BaseUploader - defines methods to upload the data to the server.
  • BaseSearcher - defines methods to search the data.

See the examples in the clients directory.

Once all the necessary classes are implemented, you can register the engine in the ClientFactory.

vector-db-benchmark's People

Contributors

ankane avatar eltociear avatar filipecosta90 avatar generall avatar joein avatar kacperlukawski avatar kshivendu avatar marevol avatar mjmbischoff avatar pre-commit-ci[bot] avatar qbx2 avatar tellet-q avatar timvisee avatar trengrj avatar tsmith023 avatar weaviate-git-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

vector-db-benchmark's Issues

Implement remote backend

SSH should be used as another backend, so the services might be launched on separate machines. It should make sure all the clients can see the server. It is possible to use Fabric in order to simplify running the operations.

missing field `vector_size` during run qdrant benchmark

using qdrant-client-0.11.1 cannot run quadrant benchmark, exception happened during recreate collection
code is in vector-db-benchmark/engine/clients/qdrant/configure.py

def recreate(
        self,
        distance,
        vector_size,
        collection_params,
    ):
        print("distance {}, vector_size {}, collection_params {}".format(distance,vector_size,collection_params))
        self.client.recreate_collection(
            collection_name=QDRANT_COLLECTION_NAME,
            vectors_config=rest.VectorParams(size=vector_size,distance=self.DISTANCE_MAPPING.get(distance)),
            # vector_size=vector_size,
            # distance=self.DISTANCE_MAPPING.get(distance),
            **self.collection_params
        )

this is my exception:

distance cosine, vector_size 25, collection_params {'optimizers_config': {'memmap_threshold': 10000000}}
Traceback (most recent call last):

  File "run.py", line 56, in <module>
    app()

  File "run.py", line 49, in run
    client.run_experiment(dataset, skip_upload)

  File "/home/ubuntu/workspaceBench/vector-db-benchmark/engine/base_client/client.py", line 86, in run_experiment
    self.configurator.configure(

  File "/home/ubuntu/workspaceBench/vector-db-benchmark/engine/base_client/configure.py", line 20, in configure
    return self.recreate(distance, vector_size, self.collection_params) or {}

  File "/home/ubuntu/workspaceBench/vector-db-benchmark/engine/clients/qdrant/configure.py", line 31, in recreate
    self.client.recreate_collection(

  File "/home/ubuntu/.local/lib/python3.8/site-packages/qdrant_client/qdrant_client.py", line 1191, in recreate_collection
    self.http.collections_api.create_collection(

  File "/home/ubuntu/.local/lib/python3.8/site-packages/qdrant_client/http/api/collections_api.py", line 618, in create_collection
    return self._build_for_create_collection(

  File "/home/ubuntu/.local/lib/python3.8/site-packages/qdrant_client/http/api/collections_api.py", line 193, in _build_for_create_collection
    return self.api_client.request(

  File "/home/ubuntu/.local/lib/python3.8/site-packages/qdrant_client/http/api_client.py", line 68, in request
    return self.send(request, type_)

  File "/home/ubuntu/.local/lib/python3.8/site-packages/qdrant_client/http/api_client.py", line 91, in send
    raise UnexpectedResponse.for_response(response)

qdrant_client.http.exceptions.UnexpectedResponse: Unexpected Response: 422 (Unprocessable Entity)
Raw response content:
b'{"result":null,"status":{"error":"Json deserialize error: missing field `vector_size` at line 1 column 224"},"time":0.0}'

It's there any solution to solve this problem? Thank you very much

Qdrant entries in pgvector configuration

Please see the following snippet for the pgvector configuration:

{
"name": "qdrant-m-32-ef-128",
"engine": "qdrant",
"connection_params": {},
"collection_params": {
"hnsw_config": { "m": 32, "ef_construct": 128 }
},
"search_params": [
{ "parallel": 1, "search_params": { "hnsw_ef": 64 } }, { "parallel": 1, "search_params": { "hnsw_ef": 128 } }, { "parallel": 1, "search_params": { "hnsw_ef": 256 } }, { "parallel": 1, "search_params": { "hnsw_ef": 512 } },
{ "parallel": 100, "search_params": { "hnsw_ef": 64 } }, { "parallel": 100, "search_params": { "hnsw_ef": 128 } }, { "parallel": 100, "search_params": { "hnsw_ef": 256 } }, { "parallel": 100, "search_params": { "hnsw_ef": 512 } }
],
"upload_params": { "parallel": 16 }
},
{
"name": "qdrant-m-32-ef-256",
"engine": "qdrant",
"connection_params": {},
"collection_params": {
"hnsw_config": { "m": 32, "ef_construct": 256 }
},
"search_params": [
{ "parallel": 1, "search_params": { "hnsw_ef": 64 } }, { "parallel": 1, "search_params": { "hnsw_ef": 128 } }, { "parallel": 1, "search_params": { "hnsw_ef": 256 } }, { "parallel": 1, "search_params": { "hnsw_ef": 512 } },
{ "parallel": 100, "search_params": { "hnsw_ef": 64 } }, { "parallel": 100, "search_params": { "hnsw_ef": 128 } }, { "parallel": 100, "search_params": { "hnsw_ef": 256 } }, { "parallel": 100, "search_params": { "hnsw_ef": 512 } }
],
"upload_params": { "parallel": 16 }
},
{
"name": "qdrant-m-32-ef-512",
"engine": "qdrant",
"connection_params": {},
"collection_params": {
"hnsw_config": { "m": 32, "ef_construct": 512 }
},
"search_params": [
{ "parallel": 1, "search_params": { "hnsw_ef": 64 } }, { "parallel": 1, "search_params": { "hnsw_ef": 128 } }, { "parallel": 1, "search_params": { "hnsw_ef": 256 } }, { "parallel": 1, "search_params": { "hnsw_ef": 512 } },
{ "parallel": 100, "search_params": { "hnsw_ef": 64 } }, { "parallel": 100, "search_params": { "hnsw_ef": 128 } }, { "parallel": 100, "search_params": { "hnsw_ef": 256 } }, { "parallel": 100, "search_params": { "hnsw_ef": 512 } }
],
"upload_params": { "parallel": 16 }
},
{
"name": "qdrant-m-64-ef-256",
"engine": "qdrant",
"connection_params": {},
"collection_params": {
"hnsw_config": { "m": 64, "ef_construct": 256 }
},
"search_params": [
{ "parallel": 1, "search_params": { "hnsw_ef": 64 } }, { "parallel": 1, "search_params": { "hnsw_ef": 128 } }, { "parallel": 1, "search_params": { "hnsw_ef": 256 } }, { "parallel": 1, "search_params": { "hnsw_ef": 512 } },
{ "parallel": 100, "search_params": { "hnsw_ef": 64 } }, { "parallel": 100, "search_params": { "hnsw_ef": 128 } }, { "parallel": 100, "search_params": { "hnsw_ef": 256 } }, { "parallel": 100, "search_params": { "hnsw_ef": 512 } }
],
"upload_params": { "parallel": 16 }
},
{
"name": "qdrant-m-64-ef-512",
"engine": "qdrant",
"connection_params": {},
"collection_params": {
"hnsw_config": { "m": 64, "ef_construct": 512 }
},
"search_params": [
{ "parallel": 1, "search_params": { "hnsw_ef": 64 } }, { "parallel": 1, "search_params": { "hnsw_ef": 128 } }, { "parallel": 1, "search_params": { "hnsw_ef": 256 } }, { "parallel": 1, "search_params": { "hnsw_ef": 512 } },
{ "parallel": 100, "search_params": { "hnsw_ef": 64 } }, { "parallel": 100, "search_params": { "hnsw_ef": 128 } }, { "parallel": 100, "search_params": { "hnsw_ef": 256 } }, { "parallel": 100, "search_params": { "hnsw_ef": 512 } }
],
"upload_params": { "parallel": 16 }
}

It contains entries for Qdrant itself. I think that we should either remove them, or update them to be for pgvector.

Elastic vector limit should be 4096 instead of 2048

Weaviate engine support

Weaviate should be implemented as another engine for benchmarking. It should provide a client script exposing all the methods for different operations.

Download converted datasets

We convert raw datasets each time we launch benchmark on a new machine
It is time-consuming operation.
More efficient way is to convert raw dataset once and store it in the cloud to download prepared dataset with consequent runs.
Converting should occur as a fallback, when there is no url to the ready dataset.

Progress bar for long running operations

We have no observability during loading or searching operations and have no sense whether everything is going as intended

However, it might be tough issue since it requires looking at the client's logs in real time

Add milvus

Milvus is tricky to launch with current project state since it requires 3 running containers for serving

Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object

I have launched the server using the docker image and vector DB launched fine :
Same machine in other session I am using the vector benchmark run . I am using python and pip3.10 .

[root@9049fa05600b ~]# docker run -p 6333:6333 qdrant/qdrant
_ _
__ _ | | __ __ _ _ __ | |
/ |/ _ | '__/ ` | ' | __|
| (
| | (| | | | (| | | | | |_
_, |_,|| _,|| ||_|
|
|

Access web UI at http://localhost:6333/dashboard

[2023-07-18T22:34:56.204Z INFO storage::content_manager::consensus::persistent] Initializing new raft state at ./storage/raft_state
[2023-07-18T22:34:56.236Z INFO qdrant] Distributed mode disabled
[2023-07-18T22:34:56.236Z INFO qdrant] Telemetry reporting enabled, id: a96d4b56-dba5-4f4c-9332-04ab9c9033eb
[2023-07-18T22:34:56.238Z INFO qdrant::tonic] Qdrant gRPC listening on 6334
[2023-07-18T22:34:56.238Z INFO qdrant::tonic] TLS disabled for gRPC API
[2023-07-18T22:34:56.251Z INFO qdrant::actix] TLS disabled for REST API
[2023-07-18T22:34:56.251Z INFO qdrant::actix] Qdrant HTTP listening on 6333
[2023-07-18T22:34:56.251Z INFO actix_server::builder] Starting 7 workers
[2023-07-18T22:34:56.251Z INFO actix_server::server] Actix runtime found; starting in Actix runtime
[2023-07-18T22:35:04.526Z INFO actix_web::middleware::logger] 172.17.0.1 "DELETE /collections/benchmark HTTP/1.1" 200 72 "-" "python-httpx/0.24.1" 0.000457
[2023-07-18T22:35:04.530Z INFO actix_web::middleware::logger] 172.17.0.1 "DELETE /collections/benchmark HTTP/1.1" 200 69 "-" "python-httpx/0.24.1" 0.000119
[2023-07-18T22:35:05.562Z INFO actix_web::middleware::logger] 172.17.0.1 "PUT /collections/benchmark HTTP/1.1" 200 71 "-" "python-httpx/0.24.1" 1.031083

When I am running benchmark test getting below error :

[root@9049fa05600b vector-db-benchmark]# python3.10 -m run --engines -m-16- --datasets glove-*
Running experiment: qdrant-mmap-m-16-ef-128 - glove-25-angular
Downloading http://ann-benchmarks.com/glove-25-angular.hdf5...
Moving: /tmp/tmp6q2343mx -> /root/vectordb/vector-db-benchmark/datasets/glove-25-angular/glove-25-angular.hdf5
Experiment stage: Configure
Experiment stage: Upload
1343it [00:00, 18694.49it/s]
Experiment qdrant-mmap-m-16-ef-128 - glove-25-angular interrupted
Traceback (most recent call last):
File "/root/vectordb/vector-db-benchmark/run.py", line 52, in run
client.run_experiment(dataset, skip_upload, skip_search)
File "/root/vectordb/vector-db-benchmark/engine/base_client/client.py", line 70, in run_experiment
upload_stats = self.uploader.upload(
File "/root/vectordb/vector-db-benchmark/engine/base_client/upload.py", line 56, in upload
latencies = list(
File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f5fc488fe20>'. Reason: 'TypeError("cannot pickle '_thread.RLock' object")'
Traceback (most recent call last):

File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,

File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)

File "/root/vectordb/vector-db-benchmark/run.py", line 79, in
app()

File "/root/vectordb/vector-db-benchmark/run.py", line 74, in run
raise e

File "/root/vectordb/vector-db-benchmark/run.py", line 52, in run
client.run_experiment(dataset, skip_upload, skip_search)

File "/root/vectordb/vector-db-benchmark/engine/base_client/client.py", line 70, in run_experiment
upload_stats = self.uploader.upload(

File "/root/vectordb/vector-db-benchmark/engine/base_client/upload.py", line 56, in upload
latencies = list(

File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 873, in next
raise value

multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f5fc488fe20>'. Reason: 'TypeError("cannot pickle '_thread.RLock' object")'

Also there is no content generated inside results dir.
Any pointer how to resolve this issue ?
thanks

How to embedding with batches?

I am trying to do text embedding with the pipeline. How to improve the speed with batch setting (or parallel the data collection mapping)?
The current pipe is much slower than feed the sbert encoding with list of text directly.

image

from towhee import AutoPipes
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
texts = ['this is a document', 'this is another document', 'this is a third document']*1000

# with sbert directly
embeddings = model.encode(texts, batch_size=128, show_progress_bar=True)


# with towhee pipe
from towhee import pipe, ops
text_embedding = (pipe.input('text')
         .map('text', 'embedding', ops.sentence_embedding.transformers(model_name='all-MiniLM-L6-v2',))
         .output('text', 'embedding')
     )


res = text_embedding.batch(texts)

Standardize format of search params in engine configs

Search params across different engines are structured very differently.

In Qdrant configs, we have:

{ "parallel": 1, "search_params": { "hnsw_ef": 64 } }, { "parallel": 1, "search_params": { "hnsw_ef": 128 } }, { "parallel": 1, "search_params": { "hnsw_ef": 256 } }, { "parallel": 1, "search_params": { "hnsw_ef": 512 } },

In Elasticsearch, we have (no nesting):

{ "parallel": 1, "num_candidates": 64 }, { "parallel": 1, "num_candidates": 128 }, { "parallel": 1, "num_candidates": 256 }, { "parallel": 1, "num_candidates": 256 },

In Milvus config, we have:

{ "parallel": 1, "params": { "ef": 128 } }, { "parallel": 1, "params": { "ef": 256 } }, { "parallel": 1, "params": { "ef": 512 } }

In weaviate configs we have:

{ "parallel": 1, "vectorIndexConfig": { "ef": 64} }, { "parallel": 1, "vectorIndexConfig": { "ef": 128} }, { "parallel": 1, "vectorIndexConfig": { "ef": 256} }, { "parallel": 1, "vectorIndexConfig": { "ef": 512} },

the parallel field will remain outside but rest need to passed to the engine and it should have the same name (I propose params). We need to refactor the code accordingly .

Elastic client timeout should be configurable.

Here's a sample traceback for 504 Gateway Timeout server error's on elastic client when the config/vector size leads to longer merge operations.
#103 adds a way of fixing/avoiding this issue.

Experiment elasticsearch-m-32-ef-256 - dbpedia-openai-1M-1536-angular interrupted
Traceback (most recent call last):
  File "/root/vector-db-benchmark/run.py", line 54, in run
    client.run_experiment(
  File "/root/vector-db-benchmark/engine/base_client/client.py", line 109, in run_experiment
    upload_stats = self.uploader.upload(
  File "/root/vector-db-benchmark/engine/base_client/upload.py", line 70, in upload
    post_upload_stats = self.post_upload(distance)
  File "/root/vector-db-benchmark/engine/clients/elasticsearch/upload.py", line 55, in post_upload
    cls.client.indices.forcemerge(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/utils.py", line 446, in wrapped
    return api(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/indices.py", line 1572, in forcemerge
    return self.perform_request(  # type: ignore[return-value]
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 389, in perform_request
    return self._client.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.ApiError: ApiError(504, '{\'ok\': False, \'message\': \'Post "https://172.18.128.211:18270/bench/_forcemerge?max_num_segments=1&wait_for_completion=true": net/http: timeout awaiting response headers\'}')
Traceback (most recent call last):

  File "/root/vector-db-benchmark/run.py", line 84, in <module>
    app()

  File "/root/vector-db-benchmark/run.py", line 79, in run
    raise e

  File "/root/vector-db-benchmark/run.py", line 54, in run
    client.run_experiment(

  File "/root/vector-db-benchmark/engine/base_client/client.py", line 109, in run_experiment
    upload_stats = self.uploader.upload(

  File "/root/vector-db-benchmark/engine/base_client/upload.py", line 70, in upload
    post_upload_stats = self.post_upload(distance)

  File "/root/vector-db-benchmark/engine/clients/elasticsearch/upload.py", line 55, in post_upload
    cls.client.indices.forcemerge(

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/utils.py", line 446, in wrapped
    return api(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/indices.py", line 1572, in forcemerge
    return self.perform_request(  # type: ignore[return-value]

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 389, in perform_request
    return self._client.perform_request(

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(

elasticsearch.ApiError: ApiError(504, '{\'ok\': False, \'message\': \'Post "https://172.18.128.211:18270/bench/_forcemerge?max_num_segments=1&wait_for_completion=true": net/http: timeout awaiting response headers\'}')

Here's another example on control plane operations (index creation)

Experiment stage: Configure
Experiment elasticsearch-m-32-ef-128 - deep-image-96-angular interrupted
Traceback (most recent call last):
  File "/root/vector-db-benchmark/run.py", line 54, in run
    client.run_experiment(
  File "/root/vector-db-benchmark/engine/base_client/client.py", line 106, in run_experiment
    self.configurator.configure(dataset)
  File "/root/vector-db-benchmark/engine/base_client/configure.py", line 22, in configure
    return self.recreate(dataset, self.collection_params) or {}
  File "/root/vector-db-benchmark/engine/clients/elasticsearch/configure.py", line 40, in recreate
    self.client.indices.create(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/utils.py", line 446, in wrapped
    return api(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/indices.py", line 509, in create
    return self.perform_request(  # type: ignore[return-value]
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 389, in perform_request
    return self._client.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
elasticsearch.ApiError: ApiError(503, 'process_cluster_event_timeout_exception', 'failed to process cluster event (create-index [bench], cause [api]) within 30s')
Traceback (most recent call last):

  File "/root/vector-db-benchmark/run.py", line 84, in <module>
    app()

  File "/root/vector-db-benchmark/run.py", line 79, in run
    raise e

  File "/root/vector-db-benchmark/run.py", line 54, in run
    client.run_experiment(

  File "/root/vector-db-benchmark/engine/base_client/client.py", line 106, in run_experiment
    self.configurator.configure(dataset)

  File "/root/vector-db-benchmark/engine/base_client/configure.py", line 22, in configure
    return self.recreate(dataset, self.collection_params) or {}

  File "/root/vector-db-benchmark/engine/clients/elasticsearch/configure.py", line 40, in recreate
    self.client.indices.create(

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/utils.py", line 446, in wrapped
    return api(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/indices.py", line 509, in create
    return self.perform_request(  # type: ignore[return-value]

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 389, in perform_request
    return self._client.perform_request(

  File "/usr/local/lib/python3.10/dist-packages/elasticsearch/_sync/client/_base.py", line 320, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(

elasticsearch.ApiError: ApiError(503, 'process_cluster_event_timeout_exception', 'failed to process cluster event (create-index [bench], cause [api]) within 30s')

Collect results into files

Currently, we collect results via parsing stdout of client's containers and later look at them in the console.

In the case of a remote client, all of the client's container's stdout must be sent to the first machine (on which we run main.py) for further parsing, which can be expensive.
It also requires a regular expression to parse the logs, which will either be tricky at some point or there will be multiple regular expressions.
For further analysis, we still need to write the results to files, so let's do that earlier.

If we want to write results into files on the client's machine, then we need a place to which we can write these files.
For any complicated analysis we also may need to transfer these files manually to the desired machine.

OpenSearch search run should handle rate-limiting / 429 HTTP errors

Ideally we should handle it and retry/fallback.

Sample error at query time:

6038it [01:17, 77.75it/s]
Experiment opensearch-m-16-ef-128 - glove-100-angular interrupted
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/root/vector-db-benchmark/engine/base_client/search.py", line 46, in _search_one
    search_res = cls.search_one(query, top)
  File "/root/vector-db-benchmark/engine/clients/opensearch/search.py", line 52, in search_one
    res = cls.client.search(
  File "/usr/local/lib/python3.8/dist-packages/opensearchpy/client/utils.py", line 181, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/opensearchpy/client/__init__.py", line 1742, in search
    return self.transport.perform_request(
  File "/usr/local/lib/python3.8/dist-packages/opensearchpy/transport.py", line 448, in perform_request
    raise e
  File "/usr/local/lib/python3.8/dist-packages/opensearchpy/transport.py", line 409, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/usr/local/lib/python3.8/dist-packages/opensearchpy/connection/http_urllib3.py", line 290, in perform_request
    self._raise_error(
  File "/usr/local/lib/python3.8/dist-packages/opensearchpy/connection/base.py", line 316, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
opensearchpy.exceptions.TransportError: TransportError(429, '429 Too Many Requests /bench/_search')
"""

Sample error at ingestion

opensearchpy.exceptions.TransportError: TransportError(429, '429 Too Many Requests /bench/_bulk')
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/root/vector-db-benchmark/engine/base_client/upload.py", line 89, in _upload_batch
    cls.upload_batch(batch)
  File "/root/vector-db-benchmark/engine/clients/opensearch/upload.py", line 43, in upload_batch
    cls.client.bulk(
  File "/usr/local/lib/python3.10/dist-packages/opensearchpy/client/utils.py", line 181, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/opensearchpy/client/__init__.py", line 462, in bulk
    return self.transport.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/opensearchpy/transport.py", line 448, in perform_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/opensearchpy/transport.py", line 409, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/usr/local/lib/python3.10/dist-packages/opensearchpy/connection/http_urllib3.py", line 290, in perform_request
    self._raise_error(
  File "/usr/local/lib/python3.10/dist-packages/opensearchpy/connection/base.py", line 316, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
opensearchpy.exceptions.TransportError: TransportError(429, '429 Too Many Requests /bench/_bulk')
"""


The above exception was the direct cause of the following exception:


Traceback (most recent call last):

  File "/root/vector-db-benchmark/run.py", line 91, in <module>
    app()

  File "/root/vector-db-benchmark/run.py", line 86, in run
    raise e

  File "/root/vector-db-benchmark/run.py", line 59, in run
    client.run_experiment(

  File "/root/vector-db-benchmark/engine/base_client/client.py", line 108, in run_experiment
    upload_stats = self.uploader.upload(

  File "/root/vector-db-benchmark/engine/base_client/upload.py", line 56, in upload
    latencies = list(

  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value

opensearchpy.exceptions.TransportError: TransportError(429, '429 Too Many Requests /bench/_bulk')

Automate testing of PRs across different engines

Whenever we get external contributions, we have to test the repo ourselves to ensure that the code doesn't break. This can sometimes take a long time because it's done manually and we have limited bandwidth.

Would be great if can automate it with Github actions (+our benchmarking servers if required)

Add recall metric

We calculate precision but for higher values of K, it's more valuable to optimize for recall. Therefore, we should measure it.

ann-filter-datasets arxiv-titles-384-angular doesn't support milvus

When I use vector-db-benchmark to test milvus filter performance, it raise the exception below.
The dataset is arxiv-titles-384-angular

But I see that qdrant did complete the filter testing of milvus, how can I reproduce the result?

❯ python run.py
Running experiment: milvus-default - arxiv-titles-384-angular-filters
established connection
/Users/mochix/workspace_mqdb_github/vector-db-benchmark/datasets/arxiv-titles-384-angular/arxiv already exists
Experiment stage: Configure
Experiment stage: Upload
E0317 16:08:15.044200000 4343137664 fork_posix.cc:76]                  Other threads are currently calling into gRPC, skipping fork() handlers
191it [00:05, 37.15it/s]
Experiment milvus-default - arxiv-titles-384-angular-filters interrupted
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/mochix/opt/anaconda3/envs/python_sdk_local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/mochix/workspace_mqdb_github/vector-db-benchmark/engine/base_client/upload.py", line 86, in _upload_batch
    cls.upload_batch(ids, vectors, metadata)
  File "/Users/mochix/workspace_mqdb_github/vector-db-benchmark/engine/clients/milvus/upload.py", line 58, in upload_batch
    cls.collection.insert([ids, vectors] + field_values)
  File "/Users/mochix/opt/anaconda3/envs/python_sdk_local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 544, in insert
    if not self._check_insert_data_schema(data):
  File "/Users/mochix/opt/anaconda3/envs/python_sdk_local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 174, in _check_insert_data_schema
    infer_fields = parse_fields_from_data(data)
  File "/Users/mochix/opt/anaconda3/envs/python_sdk_local/lib/python3.8/site-packages/pymilvus/orm/schema.py", line 300, in parse_fields_from_data
    fields.append(FieldSchema("", d_type))
  File "/Users/mochix/opt/anaconda3/envs/python_sdk_local/lib/python3.8/site-packages/pymilvus/orm/schema.py", line 175, in __init__
    raise DataTypeNotSupportException(0, ExceptionsMessage.FieldDtype)
pymilvus.exceptions.DataTypeNotSupportException: <DataTypeNotSupportException: (code=0, message=Field dtype must be of DataType)>
"""

image

Benchmark Analysis with Various Datasets

Hi, I was trying to do benchmark testing for Qdrant for different datasets, however the script is not running for Mnist, SIFT, NYTimes
Are there any changes to be made to run for these datasets? If yes please mention those changes

Collect clients STDERR

It is tricky to debug errors in client containers.
We remove containers right after they are done, hence we can't observe their logs.
LogCollector also does not collect tracebacks, only metrics according to regexps.

Vesper.ai support

Unless there's a specific reason it was excluded, I'd be curious about the performance of Vesper.ai

Dataset format

We convert any file with vectors into .jsonl.
This is expensive in both time and final file size.

Payload and neighbors are not supported yet.

To resolve this issue we need:

  • support payload
  • support neighbors
  • define file format

Currently, we consider two approaches:

  • Store vectors in numpy format
  • Store payload and neighbors as .jsonl

OR

  • Store everything in the same format, e.g. Apache Arrow

running benchmark on x86

Hi ,
I am using python3 -m run --engines *-m-16-* --datasets glove-* for running the benchmark on x86. please let me know how much time it might take for the benchmark to complete ?

Thanks.

Standardize all `*-default` configs and add `*-debug` with parallel = 1 for easy debugging.

This issue covers two tasks:

  • Many users try *-default as their starting point but default config is somewhat different across all the engines. So we should make it same across all engines. Say m=16, ef=128.
  • It's easier to debug vector-db-benchmark using debuggers when using parallel=1. So we should have a new config in all engines with parallel=1 in upload as well as search

Process leak between experiments

Hello. I noticed a process leak while continuously monitoring processes using this command: while true ; do ps aux | grep python ; sleep 1 ; done. The process pool starts to increase and, at a certain point, the benchmark script becomes unresponsive and eventually times out. This issue might impact the benchmark result, as the processes consume a significant amount of CPU time.

How to reproduce filtered search benchmark

Hi,
I didn't understand exactly what were the metrics in the filtered search benchmark. How did you measure precision in these benchmarks? what are the differences between the "regular" and "filtered" search?

Also, is there a way of reproducing these benchmarks, as there is for the "pure" search benchmarks? Couldn't find the configuration files for the filtered benchmark.

I would appreciate any help, thanks!

Support pulling embedding from any Huggingface dataset

Would be nice if we could support pulling embedding from any Huggingface dataset. This would make the project even more useful for external users :)

The spec for this could be like this:

{
    "name": "SciPhi/AgentSearch-V1",
    "vector_size": 100,
    "distance": "cosine",
    "type": "huggingface",
    "path": "glove-100-angular/glove-100-angular.hdf5",
    "link": "https://huggingface.co/datasets/SciPhi/AgentSearch-V1",
    "schema": {
      "vector_field": "openai",
      "payload": {
        "url": "text"
      }
    }
}

Needs some discussion before implementing

multiTenancyConfig property on weaviate schema requires updating the underlying weaviate-client

sample traceback:

Traceback (most recent call last):
  File "/root/vector-db-benchmark/run.py", line 52, in run
    client.run_experiment(dataset, skip_upload, skip_search)
  File "/root/vector-db-benchmark/engine/base_client/client.py", line 92, in run_experiment
    search_stats = searcher.search_all(
  File "/root/vector-db-benchmark/engine/base_client/search.py", line 70, in search_all
    self.setup_search()
  File "/root/vector-db-benchmark/engine/clients/weaviate/search.py", line 59, in setup_search
    self.client.schema.update_config(WEAVIATE_CLASS_NAME, self.search_params)
  File "/usr/local/lib/python3.9/dist-packages/weaviate/schema/crud_schema.py", line 389, in update_config
    check_class(new_class_schema)
  File "/usr/local/lib/python3.9/dist-packages/weaviate/schema/validate_schema.py", line 85, in check_class
    raise SchemaValidationException(f'"{key}" is not a known class definition key.')
weaviate.exceptions.SchemaValidationException: "multiTenancyConfig" is not a known class definition key.
Traceback (most recent call last):

  File "/root/vector-db-benchmark/run.py", line 79, in <module>
    app()

  File "/root/vector-db-benchmark/run.py", line 74, in run
    raise e

  File "/root/vector-db-benchmark/run.py", line 52, in run
    client.run_experiment(dataset, skip_upload, skip_search)

  File "/root/vector-db-benchmark/engine/base_client/client.py", line 92, in run_experiment
    search_stats = searcher.search_all(

  File "/root/vector-db-benchmark/engine/base_client/search.py", line 70, in search_all
    self.setup_search()

  File "/root/vector-db-benchmark/engine/clients/weaviate/search.py", line 59, in setup_search
    self.client.schema.update_config(WEAVIATE_CLASS_NAME, self.search_params)

  File "/usr/local/lib/python3.9/dist-packages/weaviate/schema/crud_schema.py", line 389, in update_config
    check_class(new_class_schema)

  File "/usr/local/lib/python3.9/dist-packages/weaviate/schema/validate_schema.py", line 85, in check_class
    raise SchemaValidationException(f'"{key}" is not a known class definition key.')

weaviate.exceptions.SchemaValidationException: "multiTenancyConfig" is not a known class definition key.

PR on weviate python client that includes a fix for it https://github.com/weaviate/weaviate-python-client/pull/345/files#diff-483705e2bda4efc3baf0ca6031a1a949adf864e7ae89813e9139f2843250fd12R19

This means the client needs to be updated to https://github.com/weaviate/weaviate-python-client/releases/tag/v3.22.0

`max_optimization_threads: 0` doesn't disable indexing

While running vector-db-benchmarks I've noticed that

  • we are updating the collection with max_optimization_threads: 0 before uploading points
  • and then once again with max_optimization_threads: 1 after upload is finished
  • I assume this is done to disable optimization/indexing during points upload
  • but I've noticed that max_optimization_threads: 0 doesn't disable indexing
    • I've disabled the max_optimization_threads: 1 request and had a collection fully indexed with max_optimization_threads: 0

If max_optimization_threads was intended to disable/enable indexing, then vector-db-benchmark can be updated to use indexing_threshold: 0 instead.

Screenshot

Notice indexed_vectors_count near the top and max_optimization_threads at the bottom.

Screenshot 2023-10-18 at 16 13 13

Can not reproduce and Results mismatch

Hello,

Thanks for the great work on this benchmark. I have a couple of questions:

It seems that we update the engine versions frequently, but I don't see any corresponding updates on the front page graph. Are there plans to address this? If not, it could be misleading.

I've reviewed a forked repo of this project, and they presented entirely different conclusions without any code changes. Additionally, I've had trouble reproducing the results shown in your front page graph. Could this be due to not updating the results in accordance with code changes?

Thank you.

Add PostgreSQL with pgvector to the benchmark

Thank you for this benchmark! It adds a lot of value for those seeking a vector store.

I would greatly appreciate it if PostgreSQL with the pgvector extension could be included in the benchmark.

Using PostgreSQL is extremely convenient, so it would be beneficial to gain insights into how it performs in comparison to other database engines.

It would provide valuable information for users who want to make informed decisions when choosing a vector database for their projects.

backoff strategy should be used for rate-limited errors on milvus or reducing batch_size config

It's recurrent to see the following type of errors on non-local setups:

pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>

Full traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/vector-db-benchmark/engine/base_client/upload.py", line 90, in _upload_batch
    cls.upload_batch(ids, vectors, metadata)
  File "/home/ubuntu/vector-db-benchmark/engine/clients/milvus/upload.py", line 68, in upload_batch
    cls.upload_with_backoff(field_values, ids, vectors)
  File "/usr/local/lib/python3.10/dist-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
  File "/home/ubuntu/vector-db-benchmark/engine/clients/milvus/upload.py", line 75, in upload_with_backoff
    cls.collection.insert([ids, vectors] + field_values)
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/orm/collection.py", line 443, in insert
    res = conn.batch_insert(self._name, entities, partition_name,
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 80, in handler
    raise MilvusException(e.code, f"{timeout_msg}, message={e.message}") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>
"""


The above exception was the direct cause of the following exception:


Traceback (most recent call last):

  File "/home/ubuntu/vector-db-benchmark/run.py", line 79, in <module>
    app()

  File "/home/ubuntu/vector-db-benchmark/run.py", line 74, in run
    raise e

  File "/home/ubuntu/vector-db-benchmark/run.py", line 52, in run
    client.run_experiment(dataset, skip_upload, skip_search)

  File "/home/ubuntu/vector-db-benchmark/engine/base_client/client.py", line 70, in run_experiment
    upload_stats = self.uploader.upload(

  File "/home/ubuntu/vector-db-benchmark/engine/base_client/upload.py", line 56, in upload
    latencies = list(

  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value

pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>

Given milvus configs dont specify the batch_size we're using 64 vectors, which seems to be constantly making the error state above.
I suggest to either respect API Rate Limits With a Backoff or reduce the batch size.

Use `delete_client` wherever required

We recently introduced delete_client in the base client classes for adding pgvector in #91. We need to check if there are other places where this can help.

e.g. replace closable classes of OpenSearch and elastic with delete_client functionality (and check if it works fine)

Replica of shard has state error

Benchmark program often experiences irregular errors and stops running.
By the way, just to let you know, under the same conditions, I obtained performance metrics that were twice as good in a load test conducted one week ago. I am unsure if this issue is related, though.

Current Behavior

vector-db-benchmark no error

Steps to Reproduce

1.cd vector-db-benchmark
2. python3 run.py --engines qdrant-\* --datasets gist-\* --host $hostip --no-skip-upload --no-skip-search
3. Error logs:

AnnH5Reader read_queries finished.

8it [00:00, 611.09it/s]
Traceback (most recent call last):
  File "run.py", line 58, in run
    client.run_experiment(dataset, skip_upload, skip_search)
  File "/app/qdrant_vectordb_benchmark/vector-db-benchmark-master/engine/base_client/client.py", line 88, in run_experiment
    search_stats = searcher.search_all(
  File "/app/qdrant_vectordb_benchmark/vector-db-benchmark-master/engine/base_client/search.py", line 102, in search_all
    zip(*pool.imap_unordered(search_one, iterable=tqdm.tqdm(queries)))
  File "/usr/local/python3/lib/python3.8/multiprocessing/pool.py", line 865, in next
    raise value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7faa182d6670>'. Reason: 'TypeError("cannot pickle '_thread.RLock' object")'
AnnH5Reader read_queries
Experiment qdrant-m-16-ef-128 - gist-960-euclidean interrupted
Traceback (most recent call last):

  File "run.py", line 85, in <module>
    app()

  File "run.py", line 80, in run
    raise e

  File "run.py", line 58, in run
    client.run_experiment(dataset, skip_upload, skip_search)

  File "/app/qdrant_vectordb_benchmark/vector-db-benchmark-master/engine/base_client/client.py", line 88, in run_experiment
    search_stats = searcher.search_all(

  File "/app/qdrant_vectordb_benchmark/vector-db-benchmark-master/engine/base_client/search.py", line 102, in search_all
    zip(*pool.imap_unordered(search_one, iterable=tqdm.tqdm(queries)))

  File "/usr/local/python3/lib/python3.8/multiprocessing/pool.py", line 865, in next
    raise value

multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7faa182d6670>'. Reason: 'TypeError("cannot pickle '_thread.RLock' object")'

4. kubectl logs -f --tail=5 test-qdrant-1

[2023-08-18T01:41:14.436Z WARN  storage::content_manager::consensus_manager] Failed to apply collection meta operation entry with user error: Wrong input: Replica 7025067289929045 of shard 2 has state Some(Active), but expected Some(Initializing)
[2023-08-18T01:41:14.891Z WARN  storage::content_manager::consensus_manager] Failed to apply collection meta operation entry with user error: Wrong input: Replica 4259549066635120 of shard 0 has state Some(Active), but expected Some(Initializing)

Context (Environment)

uname -a

Linux master 5.4.251-1.el7.elrepo.x86_64 #1 SMP Thu Jul 27 18:49:53 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/redhat-release

CentOS Linux release 7.4.1708 (Core)

Parallel client operations

I'd like to have the possibility to run any client operation using several instances of the client. It should be done by providing an optional number of instances as a script parameter: --parallel=8.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.