GithubHelp home page GithubHelp logo

rob7hunter / lucandra Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tjake/solandra

0.0 1.0 1.0 21.67 MB

Lucandra = Lucene + Cassandra

License: Apache License 2.0

Shell 0.48% JavaScript 0.99% Java 98.52%

lucandra's Introduction

Lucandra - A Cassandra Backend for Lucene/Solr

By Jake Luciani - http://twitter.com/tjake
==============================================

Lucandra provides a Lucene IndexReader and IndexWriter that interfaces with Cassandra.
Solr is also supported.

Support
=======
Mailing List:
     http://groups.google.com/group/lucandra-user

Useful articles:
     http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

Slides:
     http://www.slideshare.net/otisg/lucandra

You can see Lucandra in action here:
     http://sparse.ly


Bookmarks demo:
==============
Lucandra includes a delicious like bookmarks demo to get you started. to build run the following:

1. Setup Cassandra 0.6 with storage-conf.xml in config

2. ant lucandra.jar
3. ant test -Dcassandra.host=127.0.0.1 -Dcassandra.port=9160 -Dcassandra.framed=false

#edit run-demo with appropriate settings (delicious clone)
4. run_demo.sh -index bookmarks.tsv
5. run_demo.sh -search title:linu*


Solr example:
=============
Lucandra also supports Solr. to build run the following:

1. Setup Cassandra 0.6 with storage-conf.xml in config
2. ant lucandra.jar
3. cd solr-example; java -jar start.jar
4. cd exampledocs; ./post.sh *.xml
5. surf to http://localhost:8983/solr/admin/


Background
==========

Storing an inverted index in Cassandra was the initial use-case for Cassandra at Facebook.
The Cassandra wiki discusses this:

"You can think of each super column name as a term and the columns within as the docids
with rank info and other attributes being a part of it. If you have keys as the userids
then you can have a per-user index stored in this form. This is how the per user index
 for term search is laid out for Inbox search at Facebook."

Initially we implemented Lucene support with supercolumn as described but we ran into
a major scaling issue when we tried to index all of wikipedia.
Turns out Cassandra keeps the supercolumn in memory for a given key.
Also all columns for a key are tied to one cassandra node so we don't gain much scalability this way.

Thankfully Cassandra recently added support for distributed ordered keys that
allows us to use keys to store index terms without supercolumns.

Implementation Notes
======================

The Lucandra Cassandra config looks like this.

<Keyspaces>
  <Keyspace Name="Lucandra">
      <ColumnFamily CompareWith="BytesType" Name="TermVectors"/>
      <ColumnFamily CompareWith="BytesType" Name="Documents"/>
  </Keyspace>
</Keyspaces>

* To get the most out of Lucandra you should use a OrderedPreservingParitioner.
However if you really want to use the RandomPartitioner you can but you'll loose
some of lucene's most powerful features (Wildcards,Sort,Range Queries).

* Documents Ids are currently random and autogenerated.

* Term keys and Document Keys are encoded as follows (using a random binary delimiter)

      Term Key                     col name         value
      "index_name/field/term" => { documentId , position vector }

      Document Key
      "index_name/documentId" => { fieldName , value }


The IndexReader caches terms aggressively during search and tries to avoid lots of back and forth with Cassandra.


What Works
==========
* Real-Time indexing (documents become available almost immediately)
* No optimizing
* Search
* Sort
* Range Queries
* Delete
* Wildcards and other Lucene magic
* Faceting/Highlighting

What's Missing (for now)
==============
* You can't walk the documents with index reader


What's Slow (for now)
=====================
* Indexes with many documents and very dense terms

lucandra's People

Contributors

rob7hunter avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.