GithubHelp home page GithubHelp logo

tokee / lucene Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 87.0 83.21 MB

Experiments with low memory overhead sorting and faceting. Read the Wiki for details.

License: Apache License 2.0

Shell 0.01% JavaScript 0.42% Perl 0.45% Java 99.12%

lucene's Introduction

Lucene README file

INTRODUCTION

Lucene is a Java full-text search engine.  Lucene is not a complete
application, but rather a code library and API that can easily be used
to add search capabilities to applications.

The Lucene web site is at:
  http://lucene.apache.org/

Please join the Lucene-User mailing list by sending a message to:
  [email protected]

FILES

lucene-core-XX.jar
  The compiled lucene library.

lucene-demos-XX.jar
  The compiled simple example code.

luceneweb.war
  The compiled simple example Web Application.

contrib/*
  Contributed code which extends and enhances Lucene, but is not
  part of the core library.  Of special note are the JAR files in the analyzers directory which
  contain various analyzers that people may find useful in place of the StandardAnalyzer.



docs/index.html
  The contents of the Lucene website.

docs/api/index.html
  The Javadoc Lucene API documentation.  This includes the core
  library, the demo, as well as all of the contrib modules.

src/java
  The Lucene source code.

src/demo
  Some example code.

lucene's People

Contributors

cutting avatar erikhatcher avatar gsingers avatar hossman avatar jvanzyl avatar kojisekig avatar markrmiller avatar mikemccand avatar rmuir avatar rubys avatar s1monw avatar sigram avatar uschindler avatar yonik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

lucene's Issues

Startup time is too long

For 10 million terms, the startup time is about 20 minutes on a fast machine. The primary cause is non-sequential term lookup and cache misses. Increasing the cache to hold all terms in memory works fine, but takes just as much memory as a standard Lucene sort.

Solution:

  • Process the array with term ordinals in cache-sized chunks
  • Sort each chunk with merge sort, which results in sequential lookup of the terms as there are no cache misses
  • Perform a merge of all the chunks by using a heap to determine the next term ordinal

Provided the number of chunks is lower than the cache-size, the end result is perfect cache-usage and perfect sequential access for the merge sort step. The term access for the heap-based merge at the end will depend on whether the locale sort is well aligned to unicode sort. This should work fairly well for a number of languages and probably not well at all for some languages. Experiments will show how this turns out.

Indexes with deletions does not sort correctly

The unit-test for Exposed aka LUCENE-2335 is now capable of testing sorting for indexes with deleted documents. The unit-test fails if there are deletions, stating that the order differs for Exposed vs. the default Lucene sorter. The cause has yet to be determined.

DocID -> Term might be faulty

While experimenting on an in-house index with ~5M terms in the sort field, the mapping from docIDs to Term seems wrong. The culprit is of course the docID->termOrder mapper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.