GithubHelp home page GithubHelp logo

tpc-h-impala's Introduction

TPC-H Benchmark on Cloudera Impala

You can run TPC-H on your Impala cluster. This scripts based on "TPC-H Benchmark on Hive".

This README covers the following topics.

  1. How to set up Cloudera Impala
  2. How to generate/prepare the data
  3. How to run the queries

How to set up Cloudera Impala

See below.
https://ccp.cloudera.com/display/IMPALA10BETADOC/Installing+Impala

And, you need to setup Hive cluster to execute DDL (CREATE, DROP).
https://ccp.cloudera.com/display/CDH4DOC/Hive+Installation

If you want to use virtual machine environment, you can download image files.
https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Impala+Demo+VM

How to generate/prepare the data

The data is generated using the DBGEN software on TPC-H website.
See README in the DBGEN install package on details of how to generate the dataset.

After the dataset is generated, they need to be loaded in to Hadoop distributed file system (HDFS).
There is a script for doing that under ./data directory. But first you have to move all the dataset to that directory.
Then you can upload them to HDFS by execute the following command:
$./tpch_prepare_data.sh

After running the script, you can check the data on HDFS with the following command:
$hadoop fs -ls /tpch

Note: Maybe you need to modefy DBGEN makefile like this.
CC = gcc
DATABASE = SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH
CFLAGS = -O -DDBNAME="dss" -D$(MACHINE) -D$(DATABASE) -D$(WORKLOAD) -D_FILE_OFFSET_BITS=64

How to run the queries

You can run those queries by running the script "tpch_benchmark.sh".
There are some optional settings in benchmark.conf.
For example, edit IMPALA_CMD, if you want to connect remote impala server.

Note: This scripts NOT support Query 11 and 22. Impala 0.1 can't execute CROSS JOIN queries.

Reference

The original TPC-H on Hive: https://issues.apache.org/jira/browse/HIVE-600
The official TPC-H specification: http://www.tpc.org/tpch/spec/tpch2.14.4.pdf
"DBGEN" which generates the TPC-H test data set: http://www.tpc.org/tpch/spec/tpch_2_14_3.zip

Thanks

TPC-H on Hive developer, Yuntao Jia. ([email protected])
Clodera Impala developers.

tpc-h-impala's People

Contributors

kj-ki avatar

Watchers

James Cloos avatar Wang Lifeng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.