GithubHelp home page GithubHelp logo

stratio / tpcds Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jonathanmace/tpcds

31.0 57.0 6.0 749 KB

TPC-DS benchmarks including data generation with Spark and queries with Spark

Java 4.15% Smarty 89.06% Scala 5.85% Shell 0.71% Dockerfile 0.22%

tpcds's Introduction

Usage:

To compile, invoke

	mvn clean package

For convenience, set TPCDS_WORKLOAD_GEN to the directory where this git repository is checked out, eg:

    export TPCDS_WORKLOAD_GEN=~/tpcds

To generate data with spark

	bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSDataGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar
	
To run:

	bin/spark-submit --class edu.brown.cs.systems.tpcds.spark.SparkTPCDSWorkloadGenerator ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-5.0-jar-with-dependencies.jar

To configure the TPC-DS data set, there are a variety of configuration options.  Most of these are inherited from Databricks spark-sql-perf, which we use to generate the TPC-DS data.

The options of interest are as follows:

 - scaleFactor specifies the dataset size.  A scale factor of n generates approximately n GB of data.  Most data formats compress this quite effectively, so on disk the data will appear smaller (eg, Parquet or Orc can compress by a factor of approximately 4).
 - dataLocation specifics the location of the dataset.  Typically this will be in HDFS, and you can specify HDFS file locations as normal (eg, hdfs://<hostname>:<port>/<path>)
 - dataFormat specifies the format to store the data.  "parquet" and "orc" are good choices with high compression; "text" is also supported.

The full (default) configuration options are as follows:

	tpcds {
    	scaleFactor = 1
    	dataLocation = "hdfs://127.0.0.1:9000/tpcds"
    	dataFormat = "parquet"
    	overwrite = false
    	partitionTables = true
    	useDoubleForDecimal = false
    	clusterByPartitionColumns = false
    	filterOutNullPartitionValues = false
    	numPartitions = 1000
    	usePartitionColumns = false
    }
	
We have provided a couple of useful command line utilities, which are generated into the folder `target/appassembler/bin`:

 - list-queries lists the available queries.  It takes zero or one arguments; with zero arguments, it lists the available benchmarks; with 1 argument, it either lists a benchmark, or prints a query.  Queries are broken down into benchmarks.  Since multiple people have implemented variants of the original TPC-DS queries, we have included multiple of these variants here.  The impala-tpcds-modified-queries are a set of 20 selected queries that several work has used for benchmarking previously with Spark.
 - dsdgen is a wrapper around the dsdgen utility that TPC provides.  This package comes with precompiled dsdgen binaries for Linux and Mac, which we use for data generation.
	

tpcds's People

Contributors

jonathanmace avatar mafernandez-stratio avatar jeffra avatar darroyo-stratio avatar shamoud avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar José Manuel Díaz Sánchez avatar Juan Miguel Boyero Corral avatar Belén V Arenas avatar Mario Sánchez avatar Eva María Gaitán Cerezo avatar Jose Núñez avatar Daniel Carroza Santana avatar arincon avatar David Vallejo avatar  avatar Jose Miguel Ruiz Sierra avatar Alberto Rodriguez avatar Alicia Doblas avatar Carlos Galisteo avatar  avatar Max avatar jpgilaberte avatar gschiavon avatar Carlos García Martín avatar Raul Saez Tapia avatar  avatar Alfonso Fernández avatar Pedro Peñalver Yusta avatar Abraham Navas avatar Sergio Gómez avatar Guillermo Jiménez García avatar

Watchers

James Cloos avatar Jose Manuel Gomez avatar Sergio avatar Ignacio Mulas avatar David Vallejo avatar  avatar Federico Salomone avatar Carlos Gomez Cainzos avatar  avatar Roman Martin Gonzalez avatar Pablo J. Villacorta avatar  avatar Jorge López-Malla Matute avatar Hugo Viejo avatar Nacho Navarro Reus avatar Max avatar Francisco Madrid-S. avatar guOp avatar Hugo Domínguez Sanz avatar David Arroyo Cazorla avatar Guillermo Jiménez García avatar Jose Carlos Garcia Serrano avatar Francisco Javier Cano avatar Daniel Vázquez Álvarez avatar David Gómez Pérez avatar Javier Almodóvar Gallardo avatar Santiago Sanchez Paz avatar Ivan Moreno avatar Loreto Fernández Costas avatar Jose Núñez avatar  avatar arincon avatar César Manrique Sánchez avatar Héctor Rodríguez Soto avatar Eduardo Alonso avatar  avatar Eloy Valle Pérez avatar Jose Miguel Ruiz Sierra avatar Enrique Ruiz avatar Carlos Gutiérrez Hernández avatar  avatar  avatar armando avatar  avatar Sergio Hernández Martínez avatar Sergio avatar Carlos Huedo avatar Ángel Esteban avatar  avatar daniel.vl avatar Carlos García Martín avatar  avatar Alejandro García avatar  avatar  avatar javierruiz-stratio avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.