GithubHelp home page GithubHelp logo

es-nozzle's Introduction

Build Status

es-nozzle

es-nozzle can be used to index documents from the local filesystem or from network shares. It's similar in purpose to dadoonet's filesystem river, but it's not an elasticsearch plugin. Instead es-nozzle takes advantage of RabbitMQ in order to provide a fault tolerant and scalable system for synchronizing filesystems into an elasticsearch cluster.

Please visit http://brainbot.com/es-nozzle/doc/ for detailed documentation.

source code

The es-nozzle source code is hosted on github: https://github.com/brainbot-com/es-nozzle

es-nozzle is written in clojure and uses leiningen as its build system. In order to build from source, install leiningen and run lein uberjar.

Downloads

Releases

es-nozzle releases can be downloaded from http://brainbot.com/es-nozzle/download/

Please follow this link to view the documentation of the latest es-nozzle release

Snapshots

Current development snapshots of es-nozzle are available from http://brainbot.com/es-nozzle/snapshots

Please follow this link to view the documentation of the latest snapshot release

License

Copyright © 2013-2014 brainbot technologies AG

Distributed under the Apache License, Version 2.0

es-nozzle's People

Contributors

heikoheiko avatar konradkonrad avatar schmir avatar v0lk3r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

schmir binarymind

es-nozzle's Issues

filesystem vs filesystem

we need to clarify the use of filesystem in the documentation or better replace it with something else.

document source repository would be one option.
we should probably also add a glossary.

Create preview thumbnails for documents

Since tika unfortunately doesn't seem to spend any resources on that (https://issues.apache.org/jira/browse/TIKA-90), I propose we introduce some thumbnailing process into the indexing framework.

The thumbnails should be indexed as a base64 encoded field.

In a first step it would be sufficient to do this on images – yet I believe, we should think of other document types as well, e.g.:

  • first page thumbnails for office documents and pdf,
  • waveforms for audio,
  • still-images (or gif?) for video.

This leads to further requirements such as also indexing the dimensions and the content_type of the thumbnail.

Prepend 'parent' value with filesystem key

For facet views on the parent field, it is desirable to have the filesystem's name as the top-level element. Imagine two filesystems

[fs1]
path = /home/username/documents
…
[fs2]
path = /home/username/ebooks
…

that may both have a direct subfolder named /epub. If you search and facet across both indices, you end up with an ambigous facet entry with term = "/epub", since facet results do not include the index name.

For sure, there are use cases where you want that sort of unification, but I believe, the more general assumption is, that one wants to be able to discriminate the origin between two (or more) filesystems.

Hence my proposal is: the parent field is prepended by the filesystems descriptor key, so in this example we would have two entries in the facets, namely term = "/fs1/epub" and term = "/fs2/epub".

(Note: to get back the unificating behavior of the current version, one would use a script field for the facets, which is also possible for producing the disambiguating behavior proposed here out of the current version).

add flow control

we need some kind of flow control in order to not store an unlimited number of messages in rabbitmq.

build proper distribution

we should probably build an archive containing a the documention, a README, the standalone jar archive and a short shell/.bat wrapper.

document permission handling

we need to document how permissions are represented in elasticsearch + some example queries how to search with permissions.

get rid of snapshot dependencies

lein complains

[dev] [git:master] ~/nozzle/ % lein jar  
Warning: specified :main without including it in :aot. 
Implicit AOT of :main will be removed in Leiningen 3.0.0. 
If you only need AOT for your uberjar, consider adding :aot :all into your
:uberjar profile instead.
Release versions may not depend upon snapshots. 
Freeze snapshots to dated versions or set the LEIN_SNAPSHOTS_IN_RELEASE environment variable to override.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.