GithubHelp home page GithubHelp logo

baeeq / spark-lucenerdd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zouzias/spark-lucenerdd

0.0 3.0 0.0 10.83 MB

Spark RDD with Lucene's query capabilities

License: Apache License 2.0

Scala 99.39% Shell 0.61%

spark-lucenerdd's Introduction

spark-lucenerdd

Master codecov Maven Javadocs Gitter

Spark RDD with Apache Lucene's query capabilities.

The main abstractions are special types of RDD called LuceneRDD, FacetedLuceneRDD and ShapeLuceneRDD, which instantiate a Lucene index on each Spark executor. These RDDs distribute search queries and aggregate search results between the Spark driver and its executors. Currently, the following queries are supported:

Operation Syntax Description
Term Query LuceneRDD.termQuery(field, query, topK) Exact term search
Fuzzy Query LuceneRDD.fuzzyQuery(field, query, maxEdits, topK) Fuzzy term search
Phrase Query LuceneRDD.phraseQuery(field, query, topK) Phrase search
Prefix Query LuceneRDD.prefixSearch(field, prefix, topK) Prefix search
Query Parser LuceneRDD.query(queryString, topK) Query parser search
Faceted Search FacetedLuceneRDD.facetQuery(queryString, field, topK) Faceted Search
Record Linkage LuceneRDD.link(otherEntity: RDD[T], linkageFct: T => searchQuery, topK) Record linkage via Lucene queries
Circle Search ShapeLuceneRDD.circleSearch((x,y), radius, topK) Search within radius
Bbox Search ShapeLuceneRDD.bboxSearch(lowerLeft, upperLeft, topK) Bounding box
Spatial Linkage ShapeLuceneRDD.linkByRadius(RDD[T], linkage: T => (x,y), radius, topK) Spatial radius linkage

Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc. For more information on using Lucene's query parser, see Query Parser.

For example, using the query parser you can perform prefix queries on the field named textField and prefix query spar as LuceneRDD.query("textField:spar*", 10).

Examples

Here are a few examples using LuceneRDD for full text search, spatial search and record linkage. All examples exploit Lucene's flexible query language. For spatial search, lucene-spatial and jts are required.

For more, check the wiki.

Linking

You can link against this library (for Spark 1.4+) in your program at the following coordinates:

Using SBT:

libraryDependencies += "org.zouzias" %% "spark-lucenerdd" % "x.y.z"

Using Maven:

<dependency>
    <groupId>org.zouzias</groupId>
    <artifactId>spark-lucenerdd_2.11</artifactId>
    <version>x.y.z</version>
</dependency>

This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packages command line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages org.zouzias:spark-lucenerdd_2.10:0.0.XX

Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.

This library is cross-published for Scala 2.11, so 2.11 users should replace 2.10 with 2.11 in the commands listed above.

Compatibility

The project has the following compatibility with Apache Spark:

spark-lucenerdd Release Date Spark compatibility Notes Status
0.1.1-SNAPSHOT >= 1.4 master Under Development
0.2.0 (stable) 2016-09-26 2.0.0 tag v0.2.0 Released
0.1.0 (stable) 2016-09-26 1.4.x, 1.5.x, 1.6.x tag v0.1.0 Cross-released with 2.10/2.11

Project Status and Limitations

Currently the Lucene index is only stored in memory.

Implicit conversions for the primitive types (Int, Float, Double, Long, String) are supported. Moreover, implicit conversions for all product types (i.e., tuples and case classes) of the above primitives are supported. Implicits for tuples default the field names to "_1", "_2", "_3, ... following Scala's naming conventions for tuples.

Custom Case Classes

If you want to use your own custom class with LuceneRDD you can do it provided that your class member types are one of the primitive types (Int, Float, Double, Long, String).

For more details, see LuceneRDDCustomcaseClassImplicits under the tests directory.

Development

Install Java, SBT and clone the project

git clone https://github.com/zouzias/spark-lucenerdd.git
cd spark-lucenerdd
sbt compile assembly

The above will create an assembly jar containing spark-lucenerdd functionality under target/scala-*/spark-lucenerdd-assembly-*.jar

To make the spark-lucenerdd available, you have to assembly the project and add the JAR on you Spark shell or submit scripts.

spark-lucenerdd's People

Contributors

zouzias avatar

Watchers

James Cloos avatar baeeq avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.