GithubHelp home page GithubHelp logo

wl21st / solr-es-comparison Goto Github PK

View Code? Open in Web Editor NEW

This project forked from flaxsearch/solr-es-comparison

0.0 1.0 0.0 401 KB

License: The Unlicense

Python 19.55% Shell 0.69% HTML 4.40% JavaScript 32.26% CSS 5.54% XSLT 37.56%

solr-es-comparison's Introduction

SolrCloud/Elasticsearch comparison tools

This repository contains various Python scripts and config files used by Flax for our performance comparison of Solr and Elasticsearch, presented at the BCS Search Solutions event in November 2014 - the slides for which are at:

http://www.slideshare.net/charliejuggler/lucene-solrlondonug-meetup28nov2014-solr-es-performance

These files are provided for interest only, and we make no claims about their usefulness for any other application.

In order to provide a completely "fair" comparison, the exact same document set is used for both the Solr and Elasticsearch indexes. To avoid the overhead involved in downloading a large document set, we instead used a Markov chain (and a Python implementation by Shabda Raaj) to generate random documents of various sizes from a training document. Our study used data/stoicism.txt (downloaded from gutenberg.org) for training, but any "normal" text of reasonable size and should be usable for this. One thing that is currently unclear is how realistic this approach is compared with real documents, but Elasticsearch and Solr did at least receive the same data. Analysis also showed that the Markov-generated text (like natural text) obeyed Zipf's Law on word distribution, which supports its validity.

Generating random documents

The generate/generator.py script is used to generate random documents for indexing, which it saves as a gzip file. It takes the following arguments:

-h, --help  show this help message and exit
-n N        number of documents to generate
-o O        output filename
-i I        training text
--min MIN   minimum doc size in words
--max MAX   maximum doc size in words

e.g., to create 1M random documents ranging in size between 10 and 1000 words, based on data/stoicism.txt:

$ cd generate
$ python -n 1000000 -i ../data/stoicism.txt -o ../data/docs.gz --min 10 --max 1000

Indexing to Elasticsearch

Before indexing, you need to configure the index, e.g. with curl:

$ cd elasticsearch
$ curl -XPUT http://localhost:9200/speedtest [email protected]

(replacing localhost:9200 with the location of your Elasticsearch instance). Then edit the indexer.py script and set ES_URL to point to the speedtest index.

$ time python indexer.py ../data/docs.gz A

The second parameter (A in this case) is used as an ID prefix. You can run several indexers in parallel, using different ID prefixes to prevent ID clashes.

Indexing to Solr

A solr conf directory is provided in solr. You will need to upload this to SolrCloud using the usual methods (or for single node Solr, copy it over the default config). The indexer.py script needs to be edited to point SOLR_URL to the correct location. Then, the indexer is run in the same way as the Elasticsearch indexer.

Running the search tests

The main test script is loadtester.py in loadtest. It takes the arguments:

-h, --help   show help message and exit
--es ES      Elasticsearch search URL
--solr SOLR  Solr search URL
-i I         input file for words
-o O         output file
--ns NS      number of searches (default is 1)
--nt NT      number of terms (default is 1)
--nf NF      number of filters (default is 0)
--fac        use facets

For example:

$ python loadtester.py \
    --solr "http://localhost:8983/solr/collection1/query" \
    -i ../data/stoicism.txt -o test1.txt --ns 100 --nt 3

The output is simply a text file where each line records the number of documents found and the query time. To get some basic analysis of the results:

$ python analyser.py test1.txt

The merge2.py and merge3.py scripts can be used to merge the query times of two or three results files and write them as a .cvs file for importing into a spreadsheet etc.

The qps.py script runs searches repeatedly and prints the QPS to stdout. Multiple instances can be run concurrently to increase the load (there is no multithreading, currently).

solr-es-comparison's People

Contributors

flaxsearch avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.