GithubHelp home page GithubHelp logo

jesterj's Introduction

JesterJ

License Build Status

A new highly flexible, highly scaleable document ingestion system.

See the web site and the documentation for more info

Getting Started

Please see the documentation in the wiki

Project Status

Current release version: 0.1. (But current HEAD revision is much better, please build from source until 0.2)

Features:

In this release we have the following features

  • Embedded Cassandra server
  • Cassandra config and data location configurable, defaults to ~/.jj/cassandra
  • Initial support for fault tolerance via logging statuses to the embedded cassandra server (WIP)
  • Log4j appender to write to Cassandra where desired
  • Initial API/process for user written steps. (see documentation)
  • 40% test coverage (jacoco)
  • Simple filesystem scanner
  • Copy Field processor
  • Date Reformat processor
  • Human Readable File Size processor
  • Tika processor to extract content
  • Solr sender to send documents to solr in batches.
  • Runnable example to execute a plan that scans a filesystem, and indexes the documents in solr.

Release 0.1 is intended to be the smallest functional unit. Plans and steps will need to be assembled in code etc and only run locally, only single node supported. Documents indexed will have fields for mod-time, file name and file size.

Progress for 0.2

  • JDBC scanner
  • Cassandra based FTI
  • Document hashing to detect changed docs (any scanner)
  • Node and Transport style senders for Elastic
  • Ability to load Java based config from a jar file - experimental.
  • More processors: Fetch URL, Regex Replace Value, Delete Field, Parse Field as Template, URL Encode Field

The Java config feature is experimental. I wanted to use what I had built for a project but the lack of externalized configuration was a blocker. It was a quick fix but it's turnning out to be quite pleasant to work with. The down side is I'm not sure how it would carry forward to later stages of the project so it might still go away. Feedback welcome.

TODO for 0.2

  • 50% test coverage
  • Build a demo jar that can be run to demonstrate the java config usage
  • Demo/tutorial to demonstrate indexing a database and a filesystem simultaneously into solr
  • Up to date docs in wiki.
  • Publish jars on Maven Central

Release 0.2 is intended to be the minimum usable single node system.

TODO for 0.3

  • Serialized format for a plan/steps.
  • JINI Registrar
  • Register Node Service on JINI Registrar
  • Display nodes visible in control web app.
  • JINI Service to accept serialized format
  • Ability to build a plan in web-app.
  • 60% test coverage
  • Availability on maven central.
  • Build and run the 0.2 scenario via the control web-app.

Release 0.3 is intended to be similar to 0.2 but with a very basic web control UI. At this point it should be possible to install the war file, start a node,

TODO for 1.0

  • secure connections among nodes and with the web app. (credential provider)
  • Ensure nodes namespace their cassandra data dirs to avoid disasters if more than one node run per user account
  • Cassandra cluster formation
  • pass Documents among nodes using Java Spaces
  • Support for adding helper nodes that scale a step or several steps horizontally.
  • Make the control UI pretty.

Release 1.0 is intended to be the first release to spread work across nodes.

What is FTI?

FTI stands for Fault Tolerant Indexing. For our purposes this means that once a scanner is pointed at a document source, it is guaranteed to eventually do one of the following things with every qualifying document:

  • Process the document and send it to solr.
  • Log an error explaining why the document processing failed.

It will do this no matter how many nodes fail, or how many times Solr is rebooted

jesterj's People

Contributors

nsoft avatar fsparv avatar

Watchers

Joe M avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.