Spark RDD with Apache Lucene's query capabilities.
The main abstractions are special types of RDD
called LuceneRDD
, FacetedLuceneRDD
and ShapeLuceneRDD
, which instantiate a Lucene index on each Spark executor. These RDD
s distribute search queries and aggregate search results between the Spark driver and its executors. Currently, the following queries are supported:
Operation | Syntax | Description |
---|---|---|
Term Query | LuceneRDD.termQuery(field, query, topK) |
Exact term search |
Fuzzy Query | LuceneRDD.fuzzyQuery(field, query, maxEdits, topK) |
Fuzzy term search |
Phrase Query | LuceneRDD.phraseQuery(field, query, topK) |
Phrase search |
Prefix Query | LuceneRDD.prefixSearch(field, prefix, topK) |
Prefix search |
Query Parser | LuceneRDD.query(queryString, topK) |
Query parser search |
Faceted Search | FacetedLuceneRDD.facetQuery(queryString, field, topK) |
Faceted Search |
Record Linkage | LuceneRDD.link(otherEntity: RDD[T], linkageFct: T => searchQuery, topK) |
Record linkage via Lucene queries |
Circle Search | ShapeLuceneRDD.circleSearch((x,y), radius, topK) |
Search within radius |
Bbox Search | ShapeLuceneRDD.bboxSearch(lowerLeft, upperLeft, topK) |
Bounding box |
Spatial Linkage | ShapeLuceneRDD.linkByRadius(RDD[T], linkage: T => (x,y), radius, topK) |
Spatial radius linkage |
Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc. For more information on using Lucene's query parser, see Query Parser.
For example, using the query parser you can perform prefix queries on the field named textField and prefix query
spar
as LuceneRDD.query("textField:spar*", 10)
.
Here are a few examples using LuceneRDD
for full text search, spatial search and record linkage. All examples exploit Lucene's flexible query language. For spatial search, lucene-spatial
and jts
are required.
For more, check the wiki.
You can link against this library (for Spark 1.4+) in your program at the following coordinates:
Using SBT:
libraryDependencies += "org.zouzias" %% "spark-lucenerdd" % "x.y.z"
Using Maven:
<dependency>
<groupId>org.zouzias</groupId>
<artifactId>spark-lucenerdd_2.11</artifactId>
<version>x.y.z</version>
</dependency>
This library can also be added to Spark jobs launched through spark-shell
or spark-submit
by using the --packages
command line option.
For example, to include it when starting the spark shell:
$ bin/spark-shell --packages org.zouzias:spark-lucenerdd_2.10:0.0.XX
Unlike using --jars
, using --packages
ensures that this library and its dependencies will be added to the classpath.
The --packages
argument can also be used with bin/spark-submit
.
This library is cross-published for Scala 2.11, so 2.11 users should replace 2.10 with 2.11 in the commands listed above.
The project has the following compatibility with Apache Spark:
spark-lucenerdd | Release Date | Spark compatibility | Notes | Status |
---|---|---|---|---|
0.1.1-SNAPSHOT | >= 1.4 | master | Under Development | |
0.2.0 (stable) | 2016-09-26 | 2.0.0 | tag v0.2.0 | Released |
0.1.0 (stable) | 2016-09-26 | 1.4.x, 1.5.x, 1.6.x | tag v0.1.0 | Cross-released with 2.10/2.11 |
Currently the Lucene index is only stored in memory.
Implicit conversions for the primitive types (Int, Float, Double, Long, String) are supported. Moreover, implicit conversions for all product types (i.e., tuples and case classes) of the above primitives are supported. Implicits for tuples default the field names to "_1", "_2", "_3, ... following Scala's naming conventions for tuples.
If you want to use your own custom class with LuceneRDD
you can do it provided that your class member types are one of the primitive types (Int, Float, Double, Long, String).
For more details, see LuceneRDDCustomcaseClassImplicits
under the tests directory.
Install Java, SBT and clone the project
git clone https://github.com/zouzias/spark-lucenerdd.git
cd spark-lucenerdd
sbt compile assembly
The above will create an assembly jar containing spark-lucenerdd functionality under target/scala-*/spark-lucenerdd-assembly-*.jar
To make the spark-lucenerdd available, you have to assembly the project and add the JAR on you Spark shell or submit scripts.