GithubHelp home page GithubHelp logo

miku / solrbulk Goto Github PK

View Code? Open in Web Editor NEW
43.0 4.0 6.0 203 KB

SOLR bulk indexing utility for the command line.

License: GNU General Public License v3.0

Makefile 11.13% Go 75.65% Shell 13.22%
solr indexing code4lib

solrbulk's Introduction

solrbulk

Motivation:

Sometimes you need to index a bunch of documents really, really fast. Even with Solr 4.0 and soft commits, if you send one document at a time you will be limited by the network. The solution is two-fold: batching and multi-threading. http://lucidworks.com/blog/high-throughput-indexing-in-solr/

solrbulk expects as input a file with line-delimited JSON. Each line represents a single document. solrbulk takes care of reformatting the documents into the bulk JSON format, that SOLR understands.

solrbulk will send documents in batches and in parallel. The number of documents per batch can be set via -size, the number of workers with -w.

Project Status: Active โ€“ The project has reached a stable, usable state and is being actively developed. GitHub All Releases

This tool has been developed for project finc at Leipzig University Library.

Installation

Installation via Go tools.

$ go install github.com/miku/solrbulk/cmd/solrbulk@latest

There are also DEB and RPM packages available at https://github.com/miku/solrbulk/releases/.

Usage

Flags.

$ solrbulk
Usage of solrbulk:
  -commit int
        commit after this many docs (default 1000000)
  -cpuprofile string
        write cpu profile to file
  -memprofile string
        write heap profile to file
  -no-final-commit
        omit final commit
  -optimize
        optimize index
  -purge
        remove documents from index before indexing (use purge-query to selectively clean)
  -purge-pause duration
        insert a short pause after purge (default 2s)
  -purge-query string
        query to use, when purging (default "*:*")
  -server string
        url to SOLR server, including host, port and path to collection,
        e.g. http://localhost:8983/solr/biblio
  -size int
        bulk batch size (default 1000)
  -update-request-handler-name string
        where solr.UpdateRequestHandler is mounted on the server,
        https://is.gd/s0eirv (default "/update")
  -v    prints current program version
  -verbose
        output basic progress
  -w int
        number of workers to use (default 4)
  -z    unzip gz'd file on the fly

Example

Given a newline delimited JSON file:

$ cat file.ldj
{"id": "1", "state": "Alaska"}
{"id": "2", "state": "California"}
{"id": "3", "state": "Oregon"}
...

$ solrbulk -verbose -server https://192.168.1.222:8085/collection1 file.ldj

The server parameter contains host, port and path up to, but excluding the default update route for search (since 0.3.4, this can be adjusted via -update-request-handler-name flag).

For example, if you usually update via https://192.168.1.222:8085/solr/biblio/update the server parameter would be:

$ solrbulk -server https://192.168.1.222:8085/solr/biblio file.ldj

Some performance observations

  • Having as many workers as core is generally a good idea. However the returns seem to diminish fast with more cores.
  • Disable autoCommit, autoSoftCommit and the transaction log in solrconfig.xml.
  • Use some high number for -commit. solrbulk will issue a final commit request at the end of the processing anyway.
  • For some use cases, the bulk indexing approach is about twice as fast as a standard request to /solr/update.
  • On machines with more cores, try to increase maxIndexingThreads.

Elasticsearch?

Try esbulk.

solrbulk's People

Contributors

miku avatar titabo2k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

solrbulk's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.