brainbot-com / es-nozzle Goto Github PK

es-nozzle synchronizes directories into ElasticSearch

License: Apache License 2.0

Shell 6.59% Python 7.85% Clojure 85.56%

es-nozzle's Introduction

es-nozzle

es-nozzle can be used to index documents from the local filesystem or from network shares. It's similar in purpose to dadoonet's filesystem river, but it's not an elasticsearch plugin. Instead es-nozzle takes advantage of RabbitMQ in order to provide a fault tolerant and scalable system for synchronizing filesystems into an elasticsearch cluster.

Please visit http://brainbot.com/es-nozzle/doc/ for detailed documentation.

source code

The es-nozzle source code is hosted on github: https://github.com/brainbot-com/es-nozzle

es-nozzle is written in clojure and uses leiningen as its build system. In order to build from source, install leiningen and run lein uberjar.

Downloads

Releases

es-nozzle releases can be downloaded from http://brainbot.com/es-nozzle/download/

Please follow this link to view the documentation of the latest es-nozzle release

Snapshots

Current development snapshots of es-nozzle are available from http://brainbot.com/es-nozzle/snapshots

Please follow this link to view the documentation of the latest snapshot release

License

Distributed under the Apache License, Version 2.0

es-nozzle's People

Contributors

Stargazers

Watchers

Forkers

schmir binarymind

es-nozzle's Issues

Allow custom index settings

Document how one would use custom elasticsearch index settings (e.g.
different analyzer settings for different sources, …).
If this turns out unfeasible at the moment, we need to implement a way
of doing this (e.g. via index templates
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html).

filesystem vs filesystem

we need to clarify the use of filesystem in the documentation or better replace it with something else.

document source repository would be one option.
we should probably also add a glossary.

Create preview thumbnails for documents

Since tika unfortunately doesn't seem to spend any resources on that (https://issues.apache.org/jira/browse/TIKA-90), I propose we introduce some thumbnailing process into the indexing framework.

The thumbnails should be indexed as a base64 encoded field.

In a first step it would be sufficient to do this on images – yet I believe, we should think of other document types as well, e.g.:

first page thumbnails for office documents and pdf,
waveforms for audio,
still-images (or gif?) for video.

This leads to further requirements such as also indexing the dimensions and the content_type of the thumbnail.

index size of documents

It is desirable to facet/filter documents by size, therefore that needs to be indexed.

Prepend 'parent' value with filesystem key

For facet views on the parent field, it is desirable to have the filesystem's name as the top-level element. Imagine two filesystems

[fs1]
path = /home/username/documents
…
[fs2]
path = /home/username/ebooks
…

that may both have a direct subfolder named /epub. If you search and facet across both indices, you end up with an ambigous facet entry with term = "/epub", since facet results do not include the index name.

For sure, there are use cases where you want that sort of unification, but I believe, the more general assumption is, that one wants to be able to discriminate the origin between two (or more) filesystems.

Hence my proposal is: the parent field is prepended by the filesystems descriptor key, so in this example we would have two entries in the facets, namely term = "/fs1/epub" and term = "/fs2/epub".

(Note: to get back the unificating behavior of the current version, one would use a script field for the facets, which is also possible for producing the disambiguating behavior proposed here out of the current version).

[dev] [git:master] ~/nozzle/ % lein jar  
Warning: specified :main without including it in :aot. 
Implicit AOT of :main will be removed in Leiningen 3.0.0. 
If you only need AOT for your uberjar, consider adding :aot :all into your
:uberjar profile instead.
Release versions may not depend upon snapshots. 
Freeze snapshots to dated versions or set the LEIN_SNAPSHOTS_IN_RELEASE environment variable to override.