A library for querying Elasticsearch with Apache Spark
This library is compatible with Spark-2.4.x and Elasticsearch-6.x
sbt clean assembly
bin/spark-shell --jars spark-elasticsearch-connector-assembly-1.0.0-SNAPSHOT.jar --conf spark.sql.extensions=org.apache.spark.sql.rzlabs.ElasticExtensionsBuilder
In spark-shell, a temp table could be created like this:
val df = spark.read.format("org.rzlabs.elastic")
.option("index", "news")
.option("host", "localhost:9200")
.load
df.createOrReplacTempView("news")
spark.sql("select title, publish_time from news where publish_time >= '2020-01-01T00:00:00' order by publish_time DESC limit 10 offset 10").show
or you can create a hive table:
spark.sql("""
create table news using org.rzlabs.elastic options (
index "news",
host "localhost:9200"
)
""")
option | required | default | description |
---|---|---|---|
host | No | localhost:9200 | Elasticsearch server host with http port |
index | Yes | null | Elaticsearch index name |
type | No | The first type definition in the specific index mappings | the type name in a specific index |
cacheIndexMappings | No | false | If re-pulling the index mappings or not when recreate the DF use the same options |
skipUnknownTypeFields | No | false | If skip unknown type fields or not |
debugTransformation | No | false | Log debug information about the transformations or not |
timeZoneId | No | GMT | Time zone id will affect the Timestamp type field |
dateTypeFormat | No | strict_date_optional_time||epoch_millis | Date type format specified for Date type in Elasticsearch. For more information, please visit: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/date.html |
nullFillNonexistentFieldValue | No | false | If using null value to fill the nonexistent field or not, if true, then use null to fill, otherwise throw ElasticIndexException when a field is nonexistent in a InternalRow |
defaultLimit | No | Int.MaxValue | The value of the setting index.max_result_window of Elasticsearch |
- Support fields projection and pruning.
- Support nested fields schema mapping (to StructField).
- Implement
MATCH
andMATCH_PHRASE
predicates which will be pushed down asmatch
andmatch_phrase
queries. - Implement
OFFSET
keyword parsing which will be pushed down asfrom
parameter to support pagination. Limit
operator will be pushed down assize
parameter.Offset
operator on top of Aggregate will be transformed toCollectOffsetExec
node in planning phase to support global pagination.Like
andRLike
predicates will be pushed down aswildcard
andregexp
queries.And
andOr
predicates will be pushed down asbool
query.EqualTo
predicate will be pushed down asterm
query.LessThan
,GreaterThan
etc. comparison predicates will be pushed down asrange
query.In
predicate will be pushed down asterms
query.Sort
operator will be pushed down assort
parameter.- Support Join/Union operator. You can join or union es index with other datasources without any performance degradation.
- Support
Aggregate
operator pushdown to supportagg
query in Elasticsearch to improve the aggregation performance. - Support
Insert
operation to support writting data to Elasticsearch index. - Support nested fields query.
- ...