GithubHelp home page GithubHelp logo

fangyongs / gazelle_plugin Goto Github PK

View Code? Open in Web Editor NEW

This project forked from oap-project/gazelle_plugin

0.0 0.0 0.0 34.98 MB

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.

License: Apache License 2.0

JavaScript 0.04% Shell 0.06% CMake 0.26% C++ 24.73% Python 0.07% Java 1.75% Scala 69.64% PLpgSQL 3.43% C 0.01%

gazelle_plugin's Introduction

* LEGAL NOTICE: Your use of this software and any required dependent software (the "Software Package") is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the "TPP.txt" or other similarly-named text file included with the Software Package for additional details.
* Optimized Analytics Package for Spark* Platform is under Apache 2.0 (https://www.apache.org/licenses/LICENSE-2.0).

Gazelle Plugin

A Native Engine for Spark SQL with vectorized SIMD optimizations

Online Documentation

You can find the all the Gazelle Plugin documents on the project web page.

Introduction

Overview

Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. Apache Arrow provided CPU-cache friendly columnar in-memory layout, its SIMD-optimized kernels and LLVM-based SQL engine Gandiva are also very efficient.

Gazelle Plugin reimplements Spark SQL execution layer with SIMD-friendly columnar data processing based on Apache Arrow, and leverages Arrow's CPU-cache friendly columnar in-memory layout, SIMD-optimized kernels and LLVM-based expression engine to bring better performance to Spark SQL.

Key Features

Apache Arrow formatted intermediate data among Spark operator

Overview

With Spark 27396 its possible to pass a RDD of Columnarbatch to operators. We implemented this API with Arrow columnar format.

Apache Arrow based Native Readers for Parquet and other formats

Overview

A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check Arrow Data Source Note both data source V1 and V2 are supported. Please check the example section for arrow data source

Apache Arrow Compute/Gandiva based operators

Overview

We implemented common operators based on Apache Arrow Compute and Gandiva. The SQL expression was compiled to one expression tree with protobuf and passed to native kernels. The native kernels will then evaluate the these expressions based on the input columnar batch.

Native Columnar Shuffle Operator with efficient compression support

Overview

We implemented columnar shuffle to improve the shuffle performance. With the columnar layout we could do very efficient data compression for different data format.

Please check the operator supporting details here

How to use OAP: Gazelle Plugin

There are three ways to use OAP: Gazelle Plugin,

  1. Use precompiled jars
  2. Building by Conda Environment
  3. Building by Yourself

Use precompiled jars

Please go to OAP's Maven Central Repository to find Gazelle Plugin jars. For usage, you will require below two jar files:

  1. spark-arrow-datasource-standard--jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard//
  2. spark-columnar-core--jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core// Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage.

Building by Conda

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to OAP-Installation-Guide for more information. Once finished OAP-Installation-Guide, you can find built spark-columnar-core-<version>-jar-with-dependencies.jar under $HOME/miniconda2/envs/oapenv/oap_jars. Then you can just skip below steps and jump to Get Started.

Building by yourself

If you prefer to build from the source code on your hand, please follow below steps to set up your environment.

Prerequisite

There are some requirements before you build the project. Please check the document Prerequisite and make sure you have already installed the software in your system. If you are running a SPARK Cluster, please make sure all the software are installed in every single node.

Installation

Please check the document Installation Guide

Get started

To enable Gazelle Plugin, the previous built jar spark-columnar-core-<version>-jar-with-dependencies.jar should be added to Spark configuration. We also recommend to use spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar. We will demonstrate an example by using both jar files. SPARK related options are:

  • spark.driver.extraClassPath : Set to load jar file to driver.
  • spark.executor.extraClassPath : Set to load jar file to executor.
  • jars : Set to copy jar file to the executors when using yarn cluster mode.
  • spark.executorEnv.ARROW_LIBHDFS3_DIR : Optional if you are using a custom libhdfs3.so.
  • spark.executorEnv.LD_LIBRARY_PATH : Optional if you are using a custom libhdfs3.so.

For Spark Standalone Mode, please set the above value as relative path to the jar file. For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file.

More Configuration, please check the document Configuration Guide

Example to run Spark Shell with ArrowDataSource jar file

${SPARK_HOME}/bin/spark-shell \
        --verbose \
        --master yarn \
        --driver-memory 10G \
        --conf spark.plugins=com.intel.oap.GazellePlugin \
        --conf spark.driver.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-<version>-jar-with-dependencies.jar \
        --conf spark.executor.extraClassPath=$PATH_TO_JAR/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar:$PATH_TO_JAR/spark-columnar-core-<version>-jar-with-dependencies.jar \
        --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
        --conf spark.driver.cores=1 \
        --conf spark.executor.instances=12 \
        --conf spark.executor.cores=6 \
        --conf spark.executor.memory=20G \
        --conf spark.memory.offHeap.enabled=true \
        --conf spark.memory.offHeap.size=20G \
        --conf spark.task.cpus=1 \
        --conf spark.locality.wait=0s \
        --conf spark.sql.shuffle.partitions=72 \
        --conf spark.executorEnv.ARROW_LIBHDFS3_DIR="$PATH_TO_LIBHDFS3_DIR/" \
        --conf spark.executorEnv.LD_LIBRARY_PATH="$PATH_TO_LIBHDFS3_DEPENDENCIES_DIR"
        --jars $PATH_TO_JAR/spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar,$PATH_TO_JAR/spark-columnar-core-<version>-jar-with-dependencies.jar

Here is one example to verify if Gazelle Plugin works, make sure you have TPC-H dataset. We could do a simple projection on one parquet table. For detailed testing scripts, please refer to Solution Guide.

val orders = spark.read.format("arrow").load("hdfs:////user/root/date_tpch_10/orders")
orders.createOrReplaceTempView("orders")
spark.sql("select * from orders where o_orderdate > date '1998-07-26'").show(20000, false)

The result should showup on Spark console and you can check the DAG diagram with some Columnar Processing stage. Gazelle Plugin still lacks some features, please check out the limitations.

Could/K8s Integration

Amazon EMR

Please refer to Gazelle_on_EMR to find details about how to use OAP Gazelle on Amazon EMR Cloud.

Google Cloud Dataproc

Gazelle Plugin now supports to run on Dataproc 2.0, we provide a guide to help quickly install Gazelle Plugin and run TPC-DS with notebooks or scripts.

Please refer to Gazelle_on_Dataproc to find details about:

  1. Create a cluster on Dataproc 2.0 with initialization actions. Gazelle Plugin jars compiled with -Pdataproc-2.0 parameter will installed by Conda in all cluster nodes.

  2. Config for enabling Gazelle Plugin.

  3. Run TPC-DS with notebooks or scripts.

You can also find more information about how to use OAP on Google Dataproc.

Performance data

For advanced performance testing, below charts show the results by using two benchmarks with Gazelle v1.1: 1. Decision Support Benchmark1 and 2. Decision Support Benchmark2. The testing environment for Decision Support Benchmark1 is using 1 master + 3 workers and Intel(r) Xeon(r) Gold 6252 CPU|384GB memory|NVMe SSD x3 per single node with 1.5TB dataset and parquet format.

  • Decision Support Benchmark1 is a query set modified from TPC-H benchmark. We change Decimal to Double since Decimal hasn't been supported in OAP v1.0-Gazelle Plugin. Overall, the result shows a 1.49X performance speed up from OAP v1.0-Gazelle Plugin comparing to Vanilla SPARK 3.0.0. We also put the detail result by queries, most of queries in Decision Support Benchmark1 can take the advantages from Gazelle Plugin. The performance boost ratio may depend on the individual query.

Performance

Performance

The testing environment for Decision Support Benchmark2 is using 1 master + 3 workers and Intel(r) Xeon(r) Platinum 8360Y CPU|1440GB memory|NVMe SSD x4 per single node with 3TB dataset and parquet format.

  • Decision Support Benchmark2 is a query set modified from TPC-DS benchmark. We change Decimal to Doubel since Decimal hasn't been supported in OAP v1.0-Gazelle Plugin. We pick up 10 queries which can be fully supported in OAP v1.0-Gazelle Plugin and the result shows a 1.26X performance speed up comparing to Vanilla SPARK 3.0.0.

Performance

Please notes the performance data is not an official from TPC-H and TPC-DS. The actual performance result may vary by individual workloads. Please try your workloads with Gazelle Plugin first and check the DAG or log file to see if all the operators can be supported in OAP-Gazelle Plugin. Please check the detailed page on performance tuning for TPC-H and TPC-DS workloads.

Memory allocation

The memory usage in Gazelle Plugin is high. The allocations goes to two parts: 1) Java based allocation which is widely used in Arrow Java API. 2) Native side memory allocation used in each native kernel. We investigated the memory allocation behavior and made more turnings here, the memroy footprint is stable during a TPC-DS power run. Memory

Coding Style

Contact

[email protected] [email protected]

gazelle_plugin's People

Contributors

adrian-wang avatar carsonwang avatar chenghao-intel avatar eugene-mark avatar gfl94 avatar haojinintel avatar happycherry avatar hongw2019 avatar ivoson avatar jerrychenhf avatar jikunshang avatar jkself avatar lee-lei avatar lidinghao avatar luciferyang avatar marin-ma avatar moonlit-sailor avatar philo-he avatar rui-mo avatar shaowenzh avatar songzhan01 avatar tigersong avatar weiting-chen avatar xuechendi avatar xwu99 avatar yao531441 avatar yma11 avatar zhixingheyi-tian avatar zhouyuan avatar zhztheplayer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.