GithubHelp home page GithubHelp logo

quickwit-oss / search-benchmark-game Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jason-wolfe/search-index-benchmark-game

67.0 67.0 34.0 10.48 MB

Search engine benchmark (Tantivy, Lucene, PISA, ...)

Home Page: https://tantivy-search.github.io/bench/

License: MIT License

Shell 0.34% Rust 50.09% Java 13.92% Makefile 7.47% Python 4.15% Go 7.20% CMake 0.73% C++ 5.77% HTML 1.85% JavaScript 7.77% SCSS 0.40% Dockerfile 0.31%

search-benchmark-game's People

Contributors

amallia avatar fulmicoton avatar jason-wolfe avatar jpountz avatar k-yomo avatar lengyijun avatar mosuka avatar petr-tik avatar pseitz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

search-benchmark-game's Issues

Correctness tests

Please add some correctness tests. One simple idea to do so would be to compute Kendall's Tau between Lucene and the other engines.

@JMMackenzie @elshize any other ideas?

Usage of shadow is not compatible with latest gradle

Shadow version 2.0.x is not supported for usage with gradle 5.x, hence build of lucene engines fails with:

$ make build
gradle clean shadowJar

FAILURE: Build failed with an exception.

* What went wrong:
Method com/github/jengelman/gradle/plugins/shadow/internal/DependencyFileCollection.getBuildDependencies()Lorg/gradle/api/tasks/TaskDependency; is abstract

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 629ms
Makefile:14: recipe for target 'build' failed
make: *** [build] Error 1

The solution is to update to the latest shadow by

$ git diff .
diff --git a/engines/lucene-8.0.0/build.gradle b/engines/lucene-8.0.0/build.gradle
index 3f6042b..d5afc52 100644
--- a/engines/lucene-8.0.0/build.gradle
+++ b/engines/lucene-8.0.0/build.gradle
@@ -1,5 +1,5 @@
 plugins {
-    id 'com.github.johnrengelman.shadow' version '2.0.2'
+    id 'com.github.johnrengelman.shadow' version '5.1.0'
     id 'java'
 }
 

Lucene 9.6 engine actually uses Lucene 8.10.1?

I have been playing with this benchmark lately to better understand where Lucene had room for improvement compared to other engines (thank you!) and noticed that the lucene-9.6 engine actually had a dependency on Lucene 8.10.1:

implementation group: 'org.apache.lucene', name: 'lucene-core', version: '8.10.1'
implementation group: 'org.apache.lucene', name: 'lucene-analyzers-common', version: '8.10.1'
implementation group: 'org.apache.lucene', name: 'lucene-queryparser', version: '8.10.1'
.

I plan on contributing a lucene-9.8 engine soo, I could remove this lucene-9.6 engine at that time if this looks like a good way forward.

What is meant by maxscore?

The readme says:

Maxscore is not yet used in Top 10. It should give a nice boost once Lucene 8 is released.

I am a noob in search technologies but modern optimizations catch my curiosity so if anyone could expand about it that would be great!

Is maxscore used out of the box or require an explicit API (for lucene)? (if the former then it can be removed from the readme since the benchmark has been updated to Lucene 8 a long time ago)
Also it seems that in fact, Lucene 8 superseded maxscore with a successor optimization?
-> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148045/

Add Elasticsearch

Elasticsearch use lucene as a backend but still I'm sure they have added a significant number of optimizations (or slowdowns) on top of it and the world needs to know :)

Also it's be nice to have benchmarcks for the speed of reindexing changes from an external database.
E.g logstash vs the real time Hibernate search and zombodb

Specification of the desired tokenizer behavior

Could you please specify the desired behavior of the tokenizer to exactly replicate the term/result numbers for the benchmark?
I'm trying to make Seekstorm compatible with your benchmark, but so far I have been unable to replicate the exact term counts (e.g. for the term "the").

I have changed my tokenizer to pure white space tokenizing, non-alphanumeric char trimming (remove all non-letter, non-digit chars from both sides of the term), and lower-casing.
Stemming, apostrophe tokenizing/stemming, domain name tokenizing, hyphen tokenizing, umlaut/accent char folding, CJK-word segmentation, CJK full/halvewidth char folding, CJK traditional/simplified char folding have all been disabled.
I even counted the terms before indexing, to exclude any possible differences during indexing and query.

The differences are in the permille range, but in your published results (https://tantivy-search.github.io/bench/ , collection type count) Tantivy, Lucene, Pisa, Bleeve, Rucene all return exact 4,168,066 docs for the term "the". In my own test for Lucene with StandardAnalyzer (no stopwords, after commit, forceMerge and restarting Lucene) I get 4.167.903 docs for the term "the". And in Seekstorm I get 4.167.776 docs for the term "the".

Any suggestion would be much appreciated.

Show the geometric mean

It would be really nice to add a new row in the benchmarck results, showing the geometric mean of each library.
It make quick comparison easier (although since it's a much more coarse grained analysis it does not replace individual benchmarck queries comparisons) and more importantly, it make showing overall progress or regress after an update, much easier to detect and quantify.

edit OK I'm dumb there is already a row AVERAGE.
However do you use geometric mean instead of arithmetic mean?
I have the belief that geometric mean is better for comparison than arithmetic mean?
https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.cse.unsw.edu.au/~cs9242/18/papers/Fleming_Wallace_86.pdf&ved=2ahUKEwip_sTZuqD0AhUmRvEDHe7cD74QFnoECAQQBg&usg=AOvVaw2v4uabJpajKg5JmuJWXbvL

Can't parse wikipedia articles anymore

Looks like benchmark was changed in a way it probably supports /home/paul/git/search-index-benchmark-game/corpus.json whatever format that is, but no longer supports wikipedia's articles. For example using lucene-8.0.0 engine and attempt to index reveals:

$ make idx
---- Indexing Lucene ----
java -server -cp build/libs/search-index-benchmark-game-lucene-1.0-SNAPSHOT-all.jar BuildIndex idx < /ssd/karel/vcs/search-benchmark-game/wiki-articles.json
Exception in thread "main" java.lang.NullPointerException
        at BuildIndex.main(BuildIndex.java:39)
Makefile:17: recipe for target 'idx' failed
make: *** [idx] Error 1

which means parse error or better can't get id from the json line. The problem is in wikipedia articles there is no id, but rather url, title and body.

Very similar result is obtained also while testing tantivy-0.9 engine:

$ make index



---- Indexing tantivy ----
export RUST_LOG=info && target/release/build_index "idx" < /ssd/karel/vcs/search-benchmark-game/wiki-articles.json
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgument("Failed to parse document NoSuchFieldInSchema(\"body\")")', src/libcore/result.rs:997:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Makefile:19: recipe for target 'idx' failed
make: *** [idx] Error 101

again, the code expects just id and text json text fields...

Hardware transparency

Capture relevant hardware information (cpu type, Cache sizes, ...) and display it in the benchmark.

Engines' URLs

It would be nice display the engines as link, so if you click on it you go to the engine's homepage

Use same BM25 k1/b parameters across engines.

The k1 and b parameters of BM25 can influence what hits may be dynamically pruned and thus performance numbers, so it would be good to use the same values across engines. Currently it looks like engines use their own defaults, which seem to be k1=0.9 and b=0.4 for PISA, and k1=1.2 and b=0.75 for Lucene and Tantivy.

What's the License for this repo?

Context: I want to compare some workload between Lucene and Tantivy. And I would love to use and contribute this repo.

I don't currently see a license for this repo and I wish it could be something that is open such as MIT (just like most of the repos under quickwit-oss)

Continuous benchmark

Continuous benchmarking

Add a CI-like job to run the benchmark automatically.

It will help developers, potential users and tantivy-curious people to track performance numbers continuously. Automating also means less stress and hassle for the maintainers/developers of tantivy.

Granularity

We can choose to either run a benchmark on every commit or on every release.

On every commit

Integrate benchmarking suite into CI on the main tantivy repo. Using travisCI's after_success build stage, run the benchmark, append results to results.json on search-benchmark repo.

Pros:

Commit-specific perf numbers - easier to triage perf regressions.
Will create a more detailed picture of the hot path for the future.
Automated - don't have to fiddle, re-run benchmarking locally.

Costs/cons:

Too much noise - some commits are WIP or harm perf for the sake of a refactor. Is it really necessary to keep that data?
Makes every CI job run longer.
Benchmarking should be done on a dedicated machine to guarantee similar conditions. CI jobs runs inside uncontrolled layers of abstraction (docker inside VM, inside VM). To control the environment and keep it automated, we would need to dedicate a VPS instance. This is an expense, potential security vulnerability and needs administration.

On every release

Same as above, only use git-tags to tell if this commit has a new release.

Pros:

Fewer runs - cheaper on HW, doesn't slow builds down.
Releases are usually semantically important points in history, where we are interested in perf.

Cons/costs:

Still needs dedicated HW to run consistently.
Needs push access to tantivy-benchmark repo.

Presentation

Showing data from every commit might be unnecessarily overwhelming. The current benchmark front-end is clean (imho) and makes it easy to compare results across queries and versions.

On the front-end, we can show 0.6, 0.7, 0.8, 0.9 and latest commit or release.

Power-users or admins can be given the choice to massively extend the table to every commit.

Implementation

A VPS that watches the tantivy main repo, builds a benchmark and commits new results at a decided frequency.

Thoughts?

Add index size as benchmark

Index size is a very significant metric.

The BuildIndex code of each engine should do any applicable compaction at the end, then the index size is measured.

Outdated schemas

Schemas are outdated (in dropbox json there are fields like body, title and url).

Bigger collection

We need to use a bigger collection. Some options are Gov2, ClueWeb and CC-NEWS

PISA should compute top hits for task TOP_10_COUNT

It seems to me that the pisa-0.8.2 engine forces evaluation of all hits with the TOP_10_COUNT task, but it doesn't collect them into a priority queue as I would expect. So there is really no difference between TOP_10_COUNT and TOP_1000_COUNT for pisa at the moment.

Let's make it collect hits into a priority queue?

Update README

Apparently the README is outdated and states that tantivy is generally speaking faster than lucene.
This was true before Lucene 8, but it is not anymore.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.