GithubHelp home page GithubHelp logo

spark-elasticsearch-connector's Introduction

spark-elasticsearch-connector

A library for querying Elasticsearch with Apache Spark

Compatability

This library is compatible with Spark-2.4.x and Elasticsearch-6.x

Usage

Compile

sbt clean assembly

Using with spark-shell

bin/spark-shell --jars spark-elasticsearch-connector-assembly-1.0.0-SNAPSHOT.jar --conf spark.sql.extensions=org.apache.spark.sql.rzlabs.ElasticExtensionsBuilder

In spark-shell, a temp table could be created like this:

val df = spark.read.format("org.rzlabs.elastic")
  .option("index", "news")
  .option("host", "localhost:9200")
  .load
df.createOrReplacTempView("news")
spark.sql("select title, publish_time from news where publish_time >= '2020-01-01T00:00:00' order by publish_time DESC limit 10 offset 10").show

or you can create a hive table:

spark.sql("""
  create table news using org.rzlabs.elastic options (
    index "news",
    host "localhost:9200"
  )
""")

Options

option required default description
host No localhost:9200 Elasticsearch server host with http port
index Yes null Elaticsearch index name
type No The first type definition in the specific index mappings the type name in a specific index
cacheIndexMappings No false If re-pulling the index mappings or not when recreate the DF use the same options
skipUnknownTypeFields No false If skip unknown type fields or not
debugTransformation No false Log debug information about the transformations or not
timeZoneId No GMT Time zone id will affect the Timestamp type field
dateTypeFormat No strict_date_optional_time||epoch_millis Date type format specified for Date type in Elasticsearch. For more information, please visit: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/date.html
nullFillNonexistentFieldValue No false If using null value to fill the nonexistent field or not, if true, then use null to fill, otherwise throw ElasticIndexException when a field is nonexistent in a InternalRow
defaultLimit No Int.MaxValue The value of the setting index.max_result_window of Elasticsearch

Major features

Currently

  • Support fields projection and pruning.
  • Support nested fields schema mapping (to StructField).
  • Implement MATCH and MATCH_PHRASE predicates which will be pushed down as match and match_phrase queries.
  • Implement OFFSET keyword parsing which will be pushed down as from parameter to support pagination.
  • Limit operator will be pushed down as size parameter.
  • Offset operator on top of Aggregate will be transformed to CollectOffsetExec node in planning phase to support global pagination.
  • Like and RLike predicates will be pushed down as wildcard and regexp queries.
  • And and Or predicates will be pushed down as bool query.
  • EqualTo predicate will be pushed down as term query.
  • LessThan, GreaterThan etc. comparison predicates will be pushed down as range query.
  • In predicate will be pushed down as terms query.
  • Sortoperator will be pushed down as sort parameter.
  • Support Join/Union operator. You can join or union es index with other datasources without any performance degradation.

In the future

  • Support Aggregate operator pushdown to support agg query in Elasticsearch to improve the aggregation performance.
  • Support Insert operation to support writting data to Elasticsearch index.
  • Support nested fields query.
  • ...

spark-elasticsearch-connector's People

Contributors

sharpray avatar

Stargazers

Ludovic Claude avatar  avatar  avatar Hefei Li avatar  avatar  avatar Weichao Xiao avatar

Watchers

James Cloos avatar  avatar

spark-elasticsearch-connector's Issues

support spark 3.x

Add support for the greatest and latest spark versions so we can get the best performance out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.