JesterJ

A highly flexible, scalable document ingestion system designed for search.

Builds are run on infrastructure kindly donated by Crave.io

The problem

Frequently, search projects start by feeding a few documents manually to a search engine, often via the "just for testing" built in processing features of Solr such as SolrCell or post.jar. These features are documented and included in order to help the user get a feel for what they can do with Solr with a minimum of painful setup. That's how it should be for first explorations.

All too often, users who don't know any better, and are perhaps mislead by the fact that these interfaces are documented in the reference manual (and assume anything documented must be "the right way" to do it) continue developing their search system by automating the use of those same interfaces. In fairness to those users, some older versions of the Solr Ref guide failed to identify the "just for testing" nature of the interface, sometimes because it took a while for the community to realize the pitfalls associated with it.

Unfortunately, large scale ingestion of documents for search is non-trivial and those indexing interfaces not meant for production use. The usual result is that it works "ok" for a small test corpus and then becomes unstable on a larger production corpus. The code written to feed into such interfaces often needs to be repeated for several types of documents or for various document formats, and can easily lead to duplication and cut and paste copying of common functionality. Also, after investing substantial engineering to get such solutions working on a large corpus, the next thing they discover is that they have no way to recover if indexing fails part way through. In the worst cases the failure is related to the size of the corpus and the failures become increasingly common as the corpus grows until the chance of completing and indexing run is small and the system eventually cannot be indexed or upgraded at all if the problem is allowed to fester. The result is a terrible, painful and potentially expensive set of growing pains.

JesterJ's solution

JesterJ endeavors to make it easy to start with a robust full featured indexing infrastructure, so that you don't have to re-invent the wheel. JesterJ is meant to be a system you won't need to abandon until you are working with extremely large numbers of documents (and hopefully by that point you are already making good profits that can pay for a large custom solution!). A variety of re-usable processing components are provided and writing your own custom processors is as simple as implementing a 4 method interface following some simple guidelines.

Often the first version of a system for indexing documents into Solr or other search engine is fairly linear and straight forward, but as time passes features and enhancements often add complexity. Other times, the system is complex from the very start, possibly because search is being added to an existing system. JesterJ is designed to handle complex indexing scenarios. Consider the following hypothetical indexing workflow:

JesterJ handles such scenarios with a single centralized processing plan, and will ensure that if the system is unplugged, you won't get a second message about an order received. The default mode for JesterJ is to ensure at most once delivery for steps that are not marked safe or idempotent. Safe steps do not have external effects, and idempotent steps may be repeated en-route to the final processing end point.

See the website and the documentation for more info

Getting Started

Please see the documentation in the wiki

Project Status

Current release: 1.0-beta2. (Very Stale, not reccomended)

Reccomended: Build from master branch

cd /code/ingest; ./gradlew packageUnoJar
use /code/ingest/build/libs/jesterj-ingest-1.0-SNAPSHOT-node.jar
ask on discord if you have issues with the build

Next Release: 1.0-beta3

This next release will be a true feature locked beta suitable for use with all the latest features.

NOTE: The current code and the upcoming 1.0 release expect to support any design and load that can be serviced by a single machine. Scaling across many machines is a priority for future releases, but not yet available. JesterJ is explicitly designed to take advantage of machines with many processors. Automatic scaling of threads/step based on load will be in 1.1 (current estimate), but in 1.0 you can design your plan with duplicates of your slowest step to alleviate bottlenecks. This is better than linear pipeline based systems which just have to choke on whatever is slowest, and for which the only way to speed up is to duplicate everything, which makes fault tolerance extremely difficult to manage.

JDK versions

Presently only JDK 11 has been tested regularly. Any Distribution of JDK 11 should work. Support for Java 17 and future LTS versions is planned for future releases.

Discord Server

Discuss features, ask questions etc on Discord: https://discord.gg/RmdTYvpXr9

Features:

In this release we have the following features

Ability to visualize the structure of your plan (.dot or .png format: example from unit tests here )
Simple filesystem scanner for locally mounted drives (replacement for post.jar)
JDBC scanner (replacement for Data Import Handler!)
Scanners can remember what documents they've seen (or not, boolean flag)
Scanners can recognize updated content (or not, boolean flag)
Send to Solr processor with tunable batch sizes
Tika processor to extract content from Word/PDF/xml/html, etc (Replacement for SolrCell!)
Stax extract processor for dissecting xml documents directly.
Copy field processor to rename source fields to desired index field
Regexp replace processor to edit field content, or drop fields that don't match
Split field processor to split delimited values for multi-value fields
Drop field processor to get rid of annoying excess fields.
Field template processor for composing field content using a velocity template
URL encode processor to encode the value of a field and make it safe for use in URLs
Fetch URL processor for acquiring or enhancing content by contacting other systems
Log and drop processor for when you identify an invalid docuemnt
Date Reformat processor, because dates, formatting... always. (sigh)
Human Readable File Size processor
Solr sender to send documents to solr in batches.
Pre-Analyze processor to move Solr analysis workload out of Solr (just give it your schema.xml!)
Embedded Cassandra server (no need to install cassandra yourself!)
Cassandra config and data location configurable, defaults to ~/.jj/cassandra
Support for fault tolerance writing status change events to the embedded cassandra server
Initial API/process for user written document processors. (see documentation)
60% test coverage (jacoco)
Simple, single java file to configure everything, non-java programmers need only follow a simple example (for use cases not requiring custom code)
If you DO need custom code that code can be packaged as an uno-jar to provide all required dependencies and escape from any library versions that JesterJ uses! You only have to deal with your OWN jar hell, not ours! Of course, you can also just rely on whatever we already provide too. The classloaders for custom code prefer your uno-jar and then default back to whatever JesterJ has available on it's classpath.
Runnable example to execute a plan that scans a filesystem, and indexes the documents in solr.

TODO for 1.0 final release

Doc updates
Remaining issues
Beta release, testing.

Release 1.0 is intended to be the usable for single node systems, and therefore suitable for use on small to medium-sized projects (tens of millions or maybe low hundreds of million of documents).

Road Map

The best guess at any time of what will be in future releases is given by the milestones filters on our issues page

rohankumardubey / jesterj Goto Github PK

jesterj's Introduction

JesterJ

The problem

JesterJ's solution

Getting Started

Project Status

JDK versions

Discord Server

Features:

TODO for 1.0 final release

Road Map

jesterj's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs