GithubHelp home page GithubHelp logo

spark-sql-perf's Introduction

Spark SQL Performance Tests

Build Status

This is a performance testing framework for Spark SQL in Apache Spark 2.2+.

Note: This README is still under development. Please also check our source code for more information.

Quick Start

Running from command line.

$ bin/run --help

spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]

  -b <value> | --benchmark <value>
        the name of the benchmark to run
  -m <value> | --master <value
        the master url to use
  -f <value> | --filter <value>
        a filter on the name of the queries to run
  -i <value> | --iterations <value>
        the number of iterations to run
  --help
        prints this usage text
        
$ bin/run --benchmark DatasetPerformance

The first run of bin/run will build the library.

Build

Use sbt package or sbt assembly to build the library jar.
Use sbt +package to build for scala 2.11 and 2.12.

Local performance tests

The framework contains twelve benchmarks that can be executed in local mode. They are organized into three classes and target different components and functions of Spark:

  • DatasetPerformance compares the performance of the old RDD API with the new Dataframe and Dataset APIs. These benchmarks can be launched with the command bin/run --benchmark DatasetPerformance
  • JoinPerformance compares the performance of joining different table sizes and shapes with different join types. These benchmarks can be launched with the command bin/run --benchmark JoinPerformance
  • AggregationPerformance compares the performance of aggregating different table sizes using different aggregation types. These benchmarks can be launched with the command bin/run --benchmark AggregationPerformance

MLlib tests

To run MLlib tests, run /bin/run-ml yamlfile, where yamlfile is the path to a YAML configuration file describing tests to run and their parameters.

TPC-DS

Setup a benchmark

Before running any query, a dataset needs to be setup by creating a Benchmark object. Generating the TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that you will need. The fork includes changes to generate TPCDS data to stdout, so that this library can pipe them directly to Spark, without intermediate files. Therefore, this library will not work with the vanilla TPCDS kit.

TPCDS kit needs to be installed on all cluster executor nodes under the same path!

It can be found here.

// Generate the data
build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d <dsdgenDir> -s <scaleFactor> -l <location> -f <format>"
// Create the specified database
sql(s"create database $databaseName")
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = true)
// Or, if you want to create temporary tables
// tables.createTemporaryTables(location, format)

// For CBO only, gather statistics on all columns:
tables.analyzeTables(databaseName, analyzeColumns = true) 

Run benchmarking queries

After setup, users can use runExperiment function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using

import com.databricks.spark.sql.perf.tpcds.TPCDS

val tpcds = new TPCDS (sqlContext = sqlContext)
// Set:
val databaseName = ... // name of database with TPCDS data.
val resultLocation = ... // place to write results
val iterations = 1 // how many iterations of queries to run.
val queries = tpcds.tpcds2_4Queries // queries to run.
val timeout = 24*60*60 // timeout, in seconds.
// Run:
sql(s"use $databaseName")
val experiment = tpcds.runExperiment(
  queries, 
  iterations = iterations,
  resultLocation = resultLocation,
  forkThread = true)
experiment.waitForFinish(timeout)

By default, experiment will be started in a background thread. For every experiment run (i.e. every call of runExperiment), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given spark.sql.perf.results (for example /tmp/results/timestamp=1429213883272). The performance results are stored in the JSON format.

Retrieve results

While the experiment is running you can use experiment.html to get a summary, or experiment.getCurrentResults to get complete current results. Once the experiment is complete, you can still access experiment.getCurrentResults, or you can load the results from disk.

// Get all experiments results.
val resultTable = spark.read.json(resultLocation)
resultTable.createOrReplaceTempView("sqlPerformance")
sqlContext.table("sqlPerformance")
// Get the result of a particular run by specifying the timestamp of that run.
sqlContext.table("sqlPerformance").filter("timestamp = 1429132621024")
// or
val specificResultTable = spark.read.json(experiment.resultPath)

You can get a basic summary by running:

experiment.getCurrentResults // or: spark.read.json(resultLocation).filter("timestamp = 1429132621024")
  .withColumn("Name", substring(col("name"), 2, 100))
  .withColumn("Runtime", (col("parsingTime") + col("analysisTime") + col("optimizationTime") + col("planningTime") + col("executionTime")) / 1000.0)
  .select('Name, 'Runtime)

TPC-H

TPC-H can be run similarly to TPC-DS replacing tpcds for tpch. Take a look at the data generator and tpch_run notebook code below.

Running in Databricks workspace (or spark-shell)

There are example notebooks in src/main/notebooks for running TPCDS and TPCH in the Databricks environment. These scripts can also be run from spark-shell command line with minor modifications using :load file_name.scala.

TPC-multi_datagen notebook

This notebook (or scala script) can be use to generate both TPCDS and TPCH data at selected scale factors. It is a newer version from the tpcds_datagen notebook below. To use it:

  • Edit the config variables the top of the script.
  • Run the whole notebook.

tpcds_datagen notebook

This notebook can be used to install dsdgen on all worker nodes, run data generation, and create the TPCDS database. Note that because of the way dsdgen is installed, it will not work on an autoscaling cluster, and num_workers has to be updated to the number of worker instances on the cluster. Data generation may also break if any of the workers is killed - the restarted worker container will not have dsdgen anymore.

tpcds_run notebook

This notebook can be used to run TPCDS queries.

For running parallel TPCDS streams:

  • Create a Cluster and attach the spark-sql-perf library to it.
  • Create a Job using the notebook and attaching to the created cluster as "existing cluster".
  • Allow concurrent runs of the created job.
  • Launch appriopriate number of Runs of the Job to run in parallel on the cluster.

tpch_run notebook

This notebook can be used to run TPCH queries. Data needs be generated first.

spark-sql-perf's People

Contributors

bit1129 avatar davies avatar ericl avatar franklsf95 avatar jkbradley avatar joshrosen avatar juliuszsompolski avatar liancheng avatar liangz1 avatar lu-wang-dl avatar lucacanali avatar marmbrus avatar mengxr avatar mrbago avatar mrow4a avatar nongli avatar nosfe avatar npoggi avatar p8h avatar sameeragarwal avatar smurching avatar srinathshankar avatar thunterdb avatar vlyubin avatar wangyum avatar weichenxu123 avatar yhuai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-sql-perf's Issues

Is Hive a prerequisite?

Throughout the code I get the feeling that a pre-installed Hive installation is needed. Is this correct? Because when writing to (external table) I can see the spark driver assumes the destination is a Hive database.

If it is indeed needed, that should be important to add in the README.md.

java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier CREATE found

hi:
I am facing below issues, when I am trying to run this code. For this command
tables.createExternalTables("file:///home/tpctest/", "parquet", "mydata", false)
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier CREATE found

CREATE DATABASE IF NOT EXISTS mydata
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
..............

I used spark-sql-perf-0.2.4 ,scala-2.10.5 spark-1.6.1;
but this commend:
tables.createTemporaryTables("file:///home/wl/tpctest/", "parquet") has no problem,
can you help me resove this problem?
and tpcds.createResultsTable() commend has the same with tables.createExternalTables()

Spark 3.0.0 compile error

Build errors. spark 3.0.0

build.sbt snippet
...
crossScalaVersions := Seq("2.11.12", "2.12.10")
...
sparkVersion := "3.0.0"

[info] Compiling 66 Scala sources to /opt/spark-sql-perf/target/scala-2.12/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.12.10. Compiling...
[info] Compilation completed in 18.491 s
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala:338: value table is not a member of Seq[String]
[error] case UnresolvedRelation(t) => t.table
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:59: value table is not a member of Seq[String]
[error] tableIdentifier.table
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:94: missing argument list for method simpleString in class QueryPlan
[error] Unapplied methods are only converted to functions when a function type is expected.
[error] You can make this conversion explicit by writing simpleString _ or simpleString(_) instead of simpleString.
[error] messages += s"Breakdown: ${node.simpleString}"
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:107: missing argument list for method simpleString in class QueryPlan
[error] Unapplied methods are only converted to functions when a function type is expected.
[error] You can make this conversion explicit by writing simpleString _ or simpleString(_) instead of simpleString.
[error] node.simpleString.replaceAll("#\d+", ""),
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/org/apache/spark/ml/ModelBuilderSSP.scala:62: not enough arguments for constructor NaiveBayesModel: (uid: String, pi: org.apache.spark.ml.linalg.Vector, theta: org.apache.spark.ml.linalg.Matrix, sigma: org.apache.spark.ml.linalg.Matrix)org.apache.spark.ml.classification.NaiveBayesModel.
[error] Unspecified value parameter sigma.
[error] val model = new NaiveBayesModel("naivebayes-uid", pi, theta)
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/org/apache/spark/ml/ModelBuilderSSP.scala:163: not enough arguments for method getCalculator: (impurity: String, stats: Array[Double], rawCount: Long)org.apache.spark.mllib.tree.impurity.ImpurityCalculator.
[error] Unspecified value parameter rawCount.
[error] ImpurityCalculator.getCalculator("variance", Array.fillDouble(0.0))
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/org/apache/spark/ml/ModelBuilderSSP.scala:165: not enough arguments for method getCalculator: (impurity: String, stats: Array[Double], rawCount: Long)org.apache.spark.mllib.tree.impurity.ImpurityCalculator.
[error] Unspecified value parameter rawCount.
[error] ImpurityCalculator.getCalculator("gini", Array.fillDouble(0.0))
[error] ^
[error] 7 errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 178 s, completed Aug 12, 2020, 4:00:12 AM

Spark-sql-perf data generation got very slow

Hi experts @davies
Now i am using the spark-sql-perf to generate TPC-DS 1TB data with enabling partitionTables like tables.genData("hdfs://ip:8020/tpctest", "parquet", true, true, false, false, false) . But found some of big tables(e.g., store_sales) got slower to be completed(about 3hrs on 4-slave nodes). I observed that firstly all data were put in /tpcds_1t/store_sales/_temporary/0, then move to /tpcds_1t/store_sales on HDFS, these 'move' on HDFS took a lot time to complete...If some guys came cross the same issue like me ? How to resolve it ? BTW, we use TPC-DS kit from https://github.com/davies/tpcds-kit

Thanks in advance !

build errors due to dependencies

I'm new to scala and spark. Got overwhelmed with build errors due to unresolved dependencies like :

[warn] Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading com.databricks:sbt-databricks;sbtVersion=1.0;scalaVersion=2.12:0.1.3
[error] Not found
[error] Not found
[error] not found: https://repo1.maven.org/maven2/com/databricks/sbt-databricks_2.12_1.0/0.1.3/sbt-databricks-0.1.3.pom
[error] not found: https://dl.bintray.com/spark-packages/maven/com/databricks/sbt-databricks_2.12_1.0/0.1.3/sbt-databricks-0.1.3.pom
[error] not found: https://oss.sonatype.org/content/repositories/releases/com/databricks/sbt-databricks_2.12_1.0/0.1.3/sbt-databricks-0.1.3.pom
[error] not found: https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.databricks/sbt-databricks/scala_2.12/sbt_1.0/0.1.3/ivys/ivy.xml
[error] not found: https://repo.typesafe.com/typesafe/ivy-releases/com.databricks/sbt-databricks/scala_2.12/sbt_1.0/0.1.3/ivys/ivy.xml
[error] Error downloading com.jsuereth:sbt-pgp;sbtVersion=1.0;scalaVersion=2.12:1.0.0
[`

Any help ?

CI reports success despite fatal OutOfMemoryError

Looking at the most recent Travis build we see:

16/03/30 19:02:44 INFO SessionState: Created HDFS directory: file:/tmp/scratch-362b2e98-2401-4066-9a73-06f6dbbc3454/travis/447f0c84-1866-45d9-aeca-53fb780ef096/_tmp_space.db
Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
Exception in thread "Thread-1" java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at sbt.React.react(ForkTests.scala:114)
    at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-5" java.io.EOFException
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1$React.react(Framework.scala:945)
    at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1.run(Framework.scala:934)
    at java.lang.Thread.run(Thread.java:745)
๏ฟฝ[0m[๏ฟฝ[0minfo๏ฟฝ[0m] ๏ฟฝ[0m๏ฟฝ[36mRun completed in 35 seconds, 691 milliseconds.๏ฟฝ[0m๏ฟฝ[0m
๏ฟฝ[0m[๏ฟฝ[0minfo๏ฟฝ[0m] ๏ฟฝ[0m๏ฟฝ[36mTotal number of tests run: 0๏ฟฝ[0m๏ฟฝ[0m
๏ฟฝ[0m[๏ฟฝ[0minfo๏ฟฝ[0m] ๏ฟฝ[0m๏ฟฝ[36mSuites: completed 0, aborted 0๏ฟฝ[0m๏ฟฝ[0m
๏ฟฝ[0m[๏ฟฝ[0minfo๏ฟฝ[0m] ๏ฟฝ[0m๏ฟฝ[36mTests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0๏ฟฝ[0m๏ฟฝ[0m
๏ฟฝ[0m[๏ฟฝ[0minfo๏ฟฝ[0m] ๏ฟฝ[0m๏ฟฝ[33mNo tests were executed.๏ฟฝ[0m๏ฟฝ[0m
๏ฟฝ[0m[๏ฟฝ[32msuccess๏ฟฝ[0m] ๏ฟฝ[0mTotal time: 193 s, completed Mar 30, 2016 7:02:57 PM๏ฟฝ[0m

suitable exector-memory for spark-sql-perf testing

Is it okay to configure "executor-memory=1G" ?
I am running spark querying tpcds.tpcds2_4Queries(q1-q99) testing on 100G data on kubernetes cluster. I want to find out the most suitable executor-memory for these queries.

Is it possible to run only 1 filter

I am trying to run my benchmark as
./bin/run -b DatasetPerformance -f RDD
which displays only RDD related things on output namely -> average, back to back filters, back to back map, and range....

But is there a way to run only RDD: back to back filters ?

point at spark master?

This is probably a really simple question. How can I point bin/run to a running Spark cluster?

Valid Queries

Which TPC-DS queries are supposed to be working? I'm finding that several queries are not parseable, for a variety of reasons. Specifically, the following queries compile without exception:

Set(q25, q65, q50, q88, q64, q99, q46, q76, q97, q19, q17, q57, q75, q49, q90, q89, q78, q63, qSsMax, q71, q4, q93, q74, q34, q28, q82, q85, q96, q51, q40, q13, q62, q79, q59, q29, q48, q37, q52, q73, q68, q3, q15, q55, q39b, q84, q72, q31, q53, q7, q39a, q43, q61, q42, q11, q26, q21, q47, q91)

The following queries throw exceptions when compiling:

Set(q77, q60, q32, q69, q14b, q86, q98, q6, q35, q83, q66, q87, q94, q23b, q8, q27, q5, q38, q14a, q24b, q67, q56, q16, q45, q23a, q18, q12, q81, q92, q41, q33, q70, q24a, q30, q95, q22, q20, q44, q10, q1, q36, q9, q58, q80, q2, q54)

I've attached detailed output with the various exceptions that are thrown.
queries.txt

The code I used to produce this, from the shell, is:

import java.io._
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)
val writer = new PrintWriter(new File("queries.txt"))
tpcds.tpcds1_4QueriesMap.keys.foreach(k => try { tpcds.tpcds1_4QueriesMap{k}.tablesInvolved; writer.write(k + ": success\n\n"); } catch { case e: Exception => writer.write(k + ": failed, reason:\n" + e + "\n\n"); })

Is this expected? Since there are several queries split over multiple files (Simple, ImpalaKit, and TPCDS_1_4), are there a set of queries that are considered "supported" right now?

Can we put all working queries into this test kit? There are 86 out of 99 working in Spark 1.5

The IBM Spark Technology Center has made 86 (out of 99 TPCDS business queries) to run successfully on Spark 1.4.1 and 1.5.0. The existing spark-sql-perf test kit (on github from databricks) has a small subset of these. Could we add all the following queries into the kit so the community can learn/leverage?

query01
query02
query03
query04
query05
query07
query08
query09
query11
query12
query13
query15
query16
query17
query20
query21
query22
query24a
query24b
query25
query26
query27
query28
query29
query30
query31
query32
query33
query34
query36
query37
query38
query39a
query39b
query40
query42
query43
query45
query46
query47
query48
query49
query50
query51
query52
query53
query56
query57
query58
query59
query60
query61
query62
query63
query66
query67
query68
query69
query72
query73
query74
query75
query76
query77
query78
query79
query80
query81
query82
query83
query84
query85
query86
query87
query88
query89
query90
query91
query92
query93
query94
query95
query96
query97
query98
query99

Please contact me for the working queries.
Jesse Chen
[email protected]

Improvement ideas - more options

Hi, I'd like to contribute the below changes, looking for second opinions on whether this is the direction we want to go with this benchmark.

I suggest the following options be added and I'm happy to work on this

-u for a URI so we can store our data in a database such as DB2 instead of the local file system (concerned here as we'd need to pass the user and pass to the benchmark for some database configurations, wouldn't want this to be getting made available in someone's .bash_history)
-m for a custom master URL so we can use a cluster instead of just local[*]
-wsc to quickly enable WholeStageCodegen
-q to run only select queries e.g. -q 1 2 4 90

We have a variety of wholestage codegen changes that we'd like to contribute to Spark and we think this benchmark is the most fitting for our goals, with the above changes we can quickly execute individual queries and use databases, execute across multiple machines (or just not ours) and quickly see the benefits or drawbacks to WholeStageCodegen

spark-sql-perf questions

I am new to spark-sql-perf test tool. I have some of confused questions about it. Hopefully got your answers. Thanks in advance !

  1. I deployed Spark 1.6.2 cluster. Could i use spark-sql-perf v0.4.3 to run and test ? I saw the spark version is set 2.0.0 in build.sbt in v0.4.3.

  2. How can I run one specified query(e.g., q19) in spark-shell, i only see below instruction to run all of queries๏ผš
    import com.databricks.spark.sql.perf.tpcds.TPCDS
    val tpcds = new TPCDS (sqlContext = sqlContext)
    val experiment = tpcds.runExperiment(tpcds.tpcds1_4Queries)

Validating the correctness of results

We are looking to validate if the returned results of TPC-DS queries are correct. We used IBM's TPC-DS benchmark suite before, which has a process of validating the correctness of the returned results. I tried to search such tools in the repo, and could not find any... Did I miss anything? Any suggestions would be appreciated. Thanks!

Running on Spark 2.0 /Yarn

I get an error trying to import class Tables from target/scala-2.10/spark-sql-perf_2.10-0.4.7-SNAPSHOT.jar
I use $spark-shell --jars ...
scala> import com.databricks.spark.sql.perf.tpcds.Tables <console>:23: error: object databricks is not a member of package com
I also noticed:
WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

Any pointers?

Supports new tpc-ds dsdgen tool

Hello,

I am using:

  1. the last version of spark-sql-perf, compiled for Spark 1.4.0.
  2. last dsdgen taken from TPC-DS website
  3. Spark 1.4.0

When using genData() and generating data for store_sales table, the application crash with the following stacktrace: (see next post).

The parameters using for genData are: HDFS folder, 'parquet', true, false, false, false, false.

Any suggestions?

TPC-DS table primary key constraint not enforced during data generation

The TPC-DS spec defines primary key constraints for tables (Section 2.2.1.2)

However, it looks like this constraint isn't enforced while generating data. For example if I run the following two queries on the store_sales table, I get different number of rows even though the ss_item_sk is defined as the primary key.

scala> val df = sql("SELECT count(ss_item_sk) FROM store_sales")
df: org.apache.spark.sql.DataFrame = [count(ss_item_sk): bigint]
scala> df.show
+-----------------+                                                             
|count(ss_item_sk)|
+-----------------+
|          2879789|
+-----------------+
scala> val df = sql("SELECT count(distinct ss_item_sk) FROM store_sales")
df: org.apache.spark.sql.DataFrame = [count(DISTINCT ss_item_sk): bigint]
scala> df.show
+--------------------------+                                                    
|count(DISTINCT ss_item_sk)|
+--------------------------+
|                     18000|
+--------------------------+

The Query and Generate mismatch

I splited the Generate data and Query query in two jar files. Firstly generated data and then parallelly query the data. Most of tasks cost about 20ms. However some tasks cost 700ms. The reason is these taks access non exist key(files). Why the query task access non-generated files.

Following is a part of my program.

  1. Generate date code
    val tables = new TPCDSTables(sqlContext,
    dsdgenDir = "/opt/tpcds-kit/tools", // location of dsdgen
    scaleFactor = 70,
    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
    useStringForDate = false) // true to replace DateType with StringType

    tables.genData(
    location = rootDir,
    format = format,
    overwrite = true, // overwrite the data that is already there
    partitionTables = true, // create the partitioned fact tables
    clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
    filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
    tableFilter = "", // "" means generate all tables
    numPartitions = 100) // how many dsdgen part

  2. Query code
    val tables = new TPCDSTables(sqlContext,
    dsdgenDir = "/opt/tpcds-kit/tools", // location of dsdgen
    scaleFactor = 70,
    useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
    useStringForDate = false) // true to replace DateType with StringType

    // Create the specified database
    sql(s"create database $databaseName")

    // Create metastore tables in a specified database for your data.
    // Once tables are created, the current database will be switched to the specified database.
    tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = false)
    val tpcds = new TPCDS (sqlContext = sqlContext)
    // Set:
    val resultLocation = args(3) // place to write results
    val iterations = 1 // how many iterations of queries to run.
    //val queries = tpcds.tpcds2_4Queries // queries to run.
    val timeout = 246060 // timeout, in seconds.

    def queries = {
    if (args(4) == "all") {
    tpcds.tpcds2_4Queries
    } else {
    val qa = args(4).split(",", 0)
    val qs = qa.toSeq
    tpcds.tpcds2_4Queries.filter(q => {
    qs.contains(q.name)
    })
    }
    }

    println(queries.size)
    sql(s"use $databaseName")
    val experiment = tpcds.runExperiment(
    queries,
    iterations = iterations,
    resultLocation = resultLocation,
    forkThread = true)
    experiment.waitForFinish(timeout)

dataGen fails

Spark 2.0.1,

6/11/02 15:28:45 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 172.16.90.5): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 0 if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType), true) AS cs_sold_date_sk#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType), true) :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt : :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) : : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk) +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true]

if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType), true) AS cs_sold_time_sk#1 +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) : : +- input[0, org.apache.spark.sql.Row, true] : +- 1 :- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType) +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk) +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true]

TPC-DS.. dataGen.. format

table.genData(tableLocation, format, overwrite, clusterByPartitionColumns,
What value does format take when generating TPC-DS benchmarks?

How does one acquire dsdgen

I've downloaded the TPC-DS package, from tpc.org. However it does not contain anything remotely resembling dsdgen.

sbt.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found

Hi
I am getting following exception while running it. please help. Thanks

shuja@shuja:~/Desktop/data/spark-sql-perf-master$ bin/run --benchmark DatasetPerformance
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
[info] Loading project definition from /home/shuja/Desktop/data/spark-sql-perf-master/project
Missing bintray credentials /home/shuja/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to spark-sql-perf (in build file:/home/shuja/Desktop/data/spark-sql-perf-master/)
[warn] Credentials file /home/shuja/.bintray/.credentials does not exist
[info] Updating {file:/home/shuja/Desktop/data/spark-sql-perf-master/}spark-sql-perf-master...
[info] Resolving org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT
[warn] ==== local: tried
[warn] /home/shuja/.ivy2/local/org.apache.spark/spark-sql_2.10/2.0.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.10/2.0.0-SNAPSHOT/spark-sql_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== Spark Packages Repo: tried
[warn] https://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-sql_2.10/2.0.0-SNAPSHOT/spark-sql_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== apache-snapshots: tried
[warn] https://repository.apache.org/snapshots/org/apache/spark/spark-sql_2.10/2.0.0-SNAPSHOT/spark-sql_2.10-2.0.0-SNAPSHOT.pom
[info] Resolving org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT
[warn] ==== local: tried
[warn] /home/shuja/.ivy2/local/org.apache.spark/spark-hive_2.10/2.0.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.10/2.0.0-SNAPSHOT/spark-hive_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== Spark Packages Repo: tried
[warn] https://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-hive_2.10/2.0.0-SNAPSHOT/spark-hive_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== apache-snapshots: tried
[warn] https://repository.apache.org/snapshots/org/apache/spark/spark-hive_2.10/2.0.0-SNAPSHOT/spark-hive_2.10-2.0.0-SNAPSHOT.pom
[info] Resolving org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT
[warn] ==== local: tried
[warn] /home/shuja/.ivy2/local/org.apache.spark/spark-mllib_2.10/2.0.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/spark/spark-mllib_2.10/2.0.0-SNAPSHOT/spark-mllib_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== Spark Packages Repo: tried
[warn] https://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-mllib_2.10/2.0.0-SNAPSHOT/spark-mllib_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== apache-snapshots: tried
[warn] https://repository.apache.org/snapshots/org/apache/spark/spark-mllib_2.10/2.0.0-SNAPSHOT/spark-mllib_2.10-2.0.0-SNAPSHOT.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found
[warn] :: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT: not found
[warn] :: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-sql_2.10:2.0.0-SNAPSHOT ((sbtsparkpackage.SparkPackagePlugin) SparkPackagePlugin.scala#L186)
[warn] +- com.databricks:spark-sql-perf_2.10:0.4.10-SNAPSHOT
[warn] org.apache.spark:spark-hive_2.10:2.0.0-SNAPSHOT ((sbtsparkpackage.SparkPackagePlugin) SparkPackagePlugin.scala#L186)
[warn] +- com.databricks:spark-sql-perf_2.10:0.4.10-SNAPSHOT
[warn] org.apache.spark:spark-mllib_2.10:2.0.0-SNAPSHOT ((sbtsparkpackage.SparkPackagePlugin) SparkPackagePlugin.scala#L186)
[warn] +- com.databricks:spark-sql-perf_2.10:0.4.10-SNAPSHOT
sbt.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found
unresolved dependency: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT: not found
unresolved dependency: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:127)
at sbt.IvySbt.withIvy(Ivy.scala:124)
at sbt.IvySbt$Module.withModule(Ivy.scala:155)
at sbt.IvyActions$.updateEither(IvyActions.scala:165)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1369)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1365)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$87.apply(Defaults.scala:1399)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$87.apply(Defaults.scala:1397)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1402)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1396)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1419)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1310)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
error sbt.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found
[error] unresolved dependency: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT: not found
[error] unresolved dependency: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT: not found
[error] Total time: 10 s, completed Aug 17, 2016 5:19:11 PM

Error when calling Benchmark.queries

Hello,

I am using Spark 1.5.2 . I succesfully generated the data, stored it in HDFS, created a TPCDS instance and loaded the tables using createTemporaryTables(). But every time I try to run an experiment or get queries, interactiveQueries etc (from Benchmark class) I get the following stack trace:

tpcds.queries
java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.queryExecution()Lorg/apache/spark/sql/execution/QueryExecution;
at com.databricks.spark.sql.perf.Benchmark$Query.toString(Benchmark.scala:543)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
at scala.runtime.ScalaRunTime$$anonfun$scala$runtime$ScalaRunTime$$inner$1$2.apply(ScalaRunTime.scala:320)
at scala.runtime.ScalaRunTime$$anonfun$scala$runtime$ScalaRunTime$$inner$1$2.apply(ScalaRunTime.scala:320)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
at scala.collection.AbstractIterator.addString(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:320)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
at .(:10)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Spark Sql Perf

Hi

I am facing below issues, when I am trying to run this code. For this command

  1. val experiment = tpcds.runExperiment(tpcds.interactiveQueries)

WARN TaskSetManager: Stage 268 contains a task of very large size (331 KB). The maximum recommended task size is 100 KB.

//Here it is taking so much of time. When i press enter button, it enter into scala prompt i.e(scala >)

  1. tpcds.createResultsTable()

error: value createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS

cant gen data

I met error :
Caused by: java.lang.RuntimeException: Could not find dsdgen at /home/dcos/tpcds-kit-master/tools/dsdgen or //home/dcos/tpcds-kit-master/tools/dsdgen. Run install

I have run 'make' under /home/dcos/tpcds-kit-master/tools/
my scala code is :
val tables = new Tables(sqlContext, "/home/dcos/tpcds-kit-master/tools", 1) tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false, false, false, false)

Runs fail when tabled cached...

I'm running spark-sql-perf in a scenario where you would cache tables in RAM for performance reasons. Code here.

The result is blockmanager failing, see the trace below...

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.sql.perf.tpcds.Tables
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tableNames = Array("call_center", "catalog_page",
"catalog_returns", "catalog_sales", "customer",
"customer_address", "customer_demographics", "date_dim",
"household_demographics", "income_band", "inventory", "item",
"promotion", "reason", "ship_mode", "store",
"store_returns", "store_sales", "time_dim", "warehouse",
"web_page", "web_returns", "web_sales", "web_site")
for(i <- 0 to tableNames.length - 1) { sqlContext.read.parquet("hdfs://sparkc:9000/sqldata" + "/" + tableNames{i}).registerTempTable(tableNames{i})
sqlContext.cacheTable(tableNames{i})
}
val tpcds = new TPCDS (sqlContext = sqlContext)
val experiment = tpcds.runExperiment(tpcds.tpcds1_4Queries, iterations = 1)
experiment.waitForFinish(606010)

16/11/04 12:30:40 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 39.0 (TID 1545, 172.16.90.9): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_15_piece0 of broadcast_15
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1280)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:358)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_15_piece0 of broadcast_15
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273)
... 37 more

Spark session getting crashed when running val tpcds = new TPCDS (sqlContext = sqlContext) in spark 2.0.0 preview

Hi,

1)I have created the jar file of spark-sql-perf after building it and started spark-shell with that jar.

  1. When I am running below commands in spark 2.0.0 preview spark shell
    import com.databricks.spark.sql.perf.tpcds.Tables
    val tables = new Tables(sqlContext, "/root/tpcds-kit-master/dsdgen/tools/", 2)
    tables.genData("hdfs://192.168.126.128:8020/tmp/temp1", "parquet", true, true, true, true, true)
    tables.createExternalTables("hdfs://192.168.126.128:8020/tmp/temp1", "parquet", "sparkperf", true)
    import com.databricks.spark.sql.perf.tpcds.TPCDS
    val tpcds = new TPCDS (sqlContext = sqlContext) This command is causing issue in spark 2.0.0. preview.

Please find the error details in the attached file
spark shell command error.txt

Please let me know if any further information required.

It seems that run successful but error show in terminal

I use ./bin/run --benchmark DatasetPerformance

I can see the spark web page at 8080. But error shows in terminal like this.
Besides, do I need to start spark first?

[error] 16/02/01 22:03:37 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 4185 ms on localhost (1/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4210 ms on localhost (2/8)
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 4193 ms on localhost (3/8)
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 4193 ms on localhost (4/8)
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 4195 ms on localhost (5/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 4203 ms on localhost (6/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 4249 ms on localhost (7/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 4256 ms on localhost (8/8)
[error] 16/02/01 22:03:37 INFO DAGScheduler: ResultStage 0 (foreach at Benchmark.scala:604) finished in 4.287 s
[error] 16/02/01 22:03:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
[error] 16/02/01 22:03:37 INFO DAGScheduler: Job 0 finished: foreach at Benchmark.scala:604, took 4.463525 s

Use CHAR/VARCHAR types in TPCDSTables

TPC-DS schemas are different between spark-sql-perf TPCDSTables and spark-master/branch-3.1 TPCDSBase (string v.s. char/varchar). For example;

// spark
    "reason" ->
      """
        |`r_reason_sk` INT,
        |`r_reason_id` CHAR(16),
        |`r_reason_desc` CHAR(100)
      """.stripMargin,

// spark-sql-perf
    Table("reason",
      partitionColumns = Nil,
      'r_reason_sk               .int,
      'r_reason_id               .string,
      'r_reason_desc             .string),

To generated TPCDS table data for Spark (master/branch-3.1), it would be nice to use CHAR/VARCHAR types in TPCDSTables.

NOTE: This ticket comes from apache/spark#31886

Getting error when analyzing the columns

I am trying to generate data for TPC-DS and data is getting generated but while analyzing the table it throws exception.

This is the log: It starts analyzing the tables but failed with first table.
Analyzing table catalog_sales.
Analyzing table catalog_sales columns cs_sold_date_sk, cs_sold_time_sk, cs_ship_date_sk, cs_bill_customer_sk, cs_bill_cdemo_sk, cs_bill_hdemo_sk, cs_bill_addr_sk, cs_ship_customer_sk, cs_ship_cdemo_sk, cs_ship_hdemo_sk, cs_ship_addr_sk, cs_call_center_sk, cs_catalog_page_sk, cs_ship_mode_sk, cs_warehouse_sk, cs_item_sk, cs_promo_sk, cs_order_number, cs_quantity, cs_wholesale_cost, cs_list_price, cs_sales_price, cs_ext_discount_amt, cs_ext_sales_price, cs_ext_wholesale_cost, cs_ext_list_price, cs_ext_tax, cs_coupon_amt, cs_ext_ship_cost, cs_net_paid, cs_net_paid_inc_tax, cs_net_paid_inc_ship, cs_net_paid_inc_ship_tax, cs_net_profit.

org.apache.spark.sql.AnalysisException: Column cs_sold_date_sk does not exist.
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.$anonfun$getColumnsToAnalyze$3(AnalyzeColumnCommand.scala:90)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.$anonfun$getColumnsToAnalyze$1(AnalyzeColumnCommand.scala:90)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.getColumnsToAnalyze(AnalyzeColumnCommand.scala:88)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInCatalog(AnalyzeColumnCommand.scala:116)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:51)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:234)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3702)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3700)
at org.apache.spark.sql.Dataset.(Dataset.scala:234)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:671)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:666)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:672)
at com.databricks.spark.sql.perf.Tables$Table.analyzeTable(Tables.scala:280)
at com.databricks.spark.sql.perf.Tables.$anonfun$analyzeTables$2(Tables.scala:352)
at com.databricks.spark.sql.perf.Tables.$anonfun$analyzeTables$2$adapted(Tables.scala:351)
at scala.collection.immutable.List.foreach(List.scala:392)
at com.databricks.spark.sql.perf.Tables.analyzeTables(Tables.scala:351)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-4211249281701897:48)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw$$iw$$iw.(command-4211249281701897:100)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw$$iw.(command-4211249281701897:102)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw.(command-4211249281701897:104)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw.(command-4211249281701897:106)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw.(command-4211249281701897:108)
at line533d73f9645742d2b1ac0c523afb366f27.$read.(command-4211249281701897:110)
at line533d73f9645742d2b1ac0c523afb366f27.$read$.(command-4211249281701897:114)
at line533d73f9645742d2b1ac0c523afb366f27.$read$.(command-4211249281701897)
at line533d73f9645742d2b1ac0c523afb366f27.$eval$.$print$lzycompute(:7)
at line533d73f9645742d2b1ac0c523afb366f27.$eval$.$print(:6)
at line533d73f9645742d2b1ac0c523afb366f27.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:745)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1021)
at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:574)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:600)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:570)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:204)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:773)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:726)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$10(DriverLocal.scala:431)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:408)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:653)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:645)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:486)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:598)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)

Spark-sql-perf tutorial

Hi All,

I am new to Spark and Scala. I have the source code for Spark SQL Performance Tests and dsdgen .
Can anyone tell me how to proceed next ? I am done with building by giving command bin/run --help.
I am trying to execute bin/run --benchmark DatasetPerformance and it is giving me error. But before that it would be really great if someone can tell me how to start with this.I understand Readme is still under development. Is there any manual which I can follow ?

setup the benchmark

Hi,

I'm using spark 1.6.0 and I want to run the benchmark. However, I need first to setup the benchmark (I guess).

In the tutorial it's written that we have to execute those lines:

import com.databricks.spark.sql.perf.tpcds.Tables
val tables = new Tables(sqlContext, dsdgenDir, scaleFactor)
tables.genData(location, format, overwrite, partitionTables, useDoubleForDecimal, clusterByPartitionColumns, filterOutNullPartitionValues)
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(location, format, databaseName, overwrite)
// Or, if you want to create temporary tables
tables.createTemporaryTables(location, format)
// Setup TPC-DS experiment
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)

I understood that I have to run "spark-shell" first in order to run those lines, but the problem is that when i do "import com.databricks.spark.sql.perf.tpcds.Tables" I got an error " error: object sql is not a member of package com.databricks.spark". In "com.databricks.spark" there is only the "avro" package (I don't really know what it is)

Could you help me please, maybe I understood something wrong?

Thanks

dataGen error

spark2.2.0

Caused by: java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 0
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, cs_sold_date_sk), StringType), true) AS cs_sold_date_sk#939if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, cs_sold_time_sk), StringType), true) AS cs_sold_time_sk#940if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, cs_ship_date_sk), StringType), true) AS cs_ship_date_sk#941if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, cs_bill_customer_sk), StringType), true) AS cs_bill_customer_sk#942if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, cs_bill_cdemo_sk), StringType), true) AS cs_bill_cdemo_sk#943if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 5, cs_bill_hdemo_sk), StringType), true) AS cs_bill_hdemo_sk#944if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 6, cs_bill_addr_sk), StringType), true) AS cs_bill_addr_sk#945if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 7, cs_ship_customer_sk), StringType), true) AS cs_ship_customer_sk#946if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 8, cs_ship_cdemo_sk), StringType), true) AS cs_ship_cdemo_sk#947if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 9, cs_ship_hdemo_sk), StringType), true) AS cs_ship_hdemo_sk#948if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 10, cs_ship_addr_sk), StringType), true) AS cs_ship_addr_sk#949if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 11, cs_call_center_sk), StringType), true) AS cs_call_center_sk#950if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 12, cs_catalog_page_sk), StringType), true) AS cs_catalog_page_sk#951if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 13, cs_ship_mode_sk), StringType), true) AS cs_ship_mode_sk#952if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 14, cs_warehouse_sk), StringType), true) AS cs_warehouse_sk#953if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 15, cs_item_sk), StringType), true) AS cs_item_sk#954if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 16, cs_promo_sk), StringType), true) AS cs_promo_sk#955if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 17, cs_order_number), StringType), true) AS cs_order_number#956if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 18, cs_quantity), StringType), true) AS cs_quantity#957if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 19, cs_wholesale_cost), StringType), true) AS cs_wholesale_cost#958if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 20, cs_list_price), StringType), true) AS cs_list_price#959if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 21, cs_sales_price), StringType), true) AS cs_sales_price#960if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 22, cs_ext_discount_amt), StringType), true) AS cs_ext_discount_amt#961if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 23, cs_ext_sales_price), StringType), true) AS cs_ext_sales_price#962if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 24, cs_ext_wholesale_cost), StringType), true) AS cs_ext_wholesale_cost#963if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 25, cs_ext_list_price), StringType), true) AS cs_ext_list_price#964if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 26, cs_ext_tax), StringType), true) AS cs_ext_tax#965if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 27, cs_coupon_amt), StringType), true) AS cs_coupon_amt#966if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 28, cs_ext_ship_cost), StringType), true) AS cs_ext_ship_cost#967if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 29, cs_net_paid), StringType), true) AS cs_net_paid#968if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 30, cs_net_paid_inc_tax), StringType), true) AS cs_net_paid_inc_tax#969if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 31, cs_net_paid_inc_ship), StringType), true) AS cs_net_paid_inc_ship#970if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 32, cs_net_paid_inc_ship_tax), StringType), true) AS cs_net_paid_inc_ship_tax#971if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 33, cs_net_profit), StringType), true) AS cs_net_profit#972 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:173)
at org.apache.spark.sql.Row$class.isNullAt(Row.scala:191)
at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:165)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfCondExpr$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 16 more

exception when running tables.genData()

Hi,
Thanks for your devotion in developing this tool first.
My environment: Cloudera CDH 5.6.0+spark 1.5.0
I encountered this error when I tried to generate data using tables.genData():
scala> tables.genData("/hdfs2/ds", "text" ,true ,true,true,true,true) java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2570) at java.lang.Class.getDeclaredMethod(Class.java:2002) at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1431) ......
previously I was just following the README.md, here's what I've done:
import com.databricks.spark.sql.perf.tpcds.Tables import com.databricks.spark.sql.perf.tpcds.Tables scala> val tables = new Tables(sqlContext,"/root/tpc-ds/tools" , 50) tables: com.databricks.spark.sql.perf.tpcds.Tables = com.databricks.spark.sql.perf.tpcds.Tables@194765c4
I would like to know what is possibly going wrong here? Is it because that I use cloudera distribution of spark or I was using version 1.5.0 which is not supported?
Thanks for your help in advance.

How to use it?

Hi,

I am new to sbt and do not know how to use this package. I downloaded the zip file and unzip it. when i run 'sbt assembly' or 'sbt package', it is requiring to set DBC_USERNAME. Do we have to have Databrick Cloud account, in order to use this package? Could you provide some detail instructions on how to use this package? Thanks a lot!

-Xing

Cannot create tables in cluster mode - Unable to infer schema for Parquet

When I try to setup TPCDS dataset in a cluster I get an error that Spark is not able to infer parquet schema. This happens only in cluster mode, in local mode the setup finishes successfully.

I have installed tpcds kit on all nodes under the same path and the location of the data is the same as well.

Specifically I try

./bin/spark-shell --jars /root/spark-sql-perf/target/scala-2.11/spark-sql-perf-assembly-0.5.0-SNAPSHOT.jar --master spark://master:7077

scala> import spark.sqlContext.implicits._

scala> import com.databricks.spark.sql.perf.tpcds.TPCDSTables

scala> val tables = new TPCDSTables(spark.sqlContext, "/tmp/tpcds-kit-src/tools", "1", false, false)

scala> tables.genData("/tmp/tpcds-data", "parquet", true, true, true, false, "", 100)

scala> sql("create database tpcds")

scala> tables.createExternalTables("/tmp/tpcds-data", "parquet", "tpcds", true, true)
Creating external table catalog_sales in database tpcds using data stored in /tmp/tpcds-data/catalog_sales.
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:181)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
  at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:77)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:142)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:139)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:121)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
  at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:352)
  at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:319)
  at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:302)
  at org.apache.spark.sql.SQLContext.createExternalTable(SQLContext.scala:544)
  at com.databricks.spark.sql.perf.Tables$Table.createExternalTable(Tables.scala:242)
  at com.databricks.spark.sql.perf.Tables$$anonfun$createExternalTables$1.apply(Tables.scala:311)
  at com.databricks.spark.sql.perf.Tables$$anonfun$createExternalTables$1.apply(Tables.scala:309)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at com.databricks.spark.sql.perf.Tables.createExternalTables(Tables.scala:309)
  ... 50 elided

AnalysisException when calling genData

Hi,
on Spark 2.3.2:

scala> tables.genData(
     |     location = rootDir,
     |     format = format,
     |     overwrite = true, // overwrite the data that is already there
     |     partitionTables = true, // create the partitioned fact tables
     |     clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
     |     filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
     |     tableFilter = "", // "" means generate all tables
     |     numPartitions = 1) 
org.apache.spark.sql.AnalysisException: cannot resolve '`cs_sold_date_sk`' given input columns: [catalog_sales_text.value]; line 8 pos 2;
'RepartitionByExpression ['cs_sold_date_sk], 200
+- Project [value#64]
   +- SubqueryAlias catalog_sales_text
      +- LogicalRDD [value#64], false

failed to compile spark-sql-perf for spark 2.1.0

Hello,

I got the following error when trying to build the spark-sql-perf package for Spark 2.1.0. Did anyone see this before and any idea to fix it? Thanks

Steps:
git clone https://github.com/databricks/spark-sql-perf.git
cd spark-sql-perf
vi build.sbt to change the sparkVersion := "2.1.0"
./build/sbt clean package

Compilation errors:
spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:89: Cannot prove that (org.apache.spark.sql.catalyst.trees.TreeNode[_$6], Int) forSome { type $6 } <:< (T, U).
[error] val indexMap = physicalOperators.map { case (index, op) => (op, index) }.toMap
[error] ^
[error] spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:97: value execute is not a member of org.apache.spark.sql.catalyst.trees.TreeNode[
$6]
[error] newNode.execute().foreach((row: Any) => Unit)
[error] ^

Get errors on run -benchmark DatasetPerformance

Hi experts,
I am using:
spark 1.6.1
scala 2.10.5

I get many errors when run following command:
./bin/run -benchmark DatasetPerformance

The error is:
[info] Compiling 40 Scala sources to /data/tpc-ds/spark-sql-perf-master/target/scala-2.10/classes...
[error] /data/tpc-ds/spark-sql-perf-master/src/main/scala/com/databricks/spark/sql/perf/AggregationPerformance.scala:17: type mismatch;
[error] found : org.apache.spark.sql.DataFrame
[error] required: org.apache.spark.sql.Dataset[_]
[error] }.toDF("a", "b"))
[error] ^

Can you help me?
Thanks a lot.

generating data fails

The initial generation of data seems to fail:

15/09/07 23:54:23 INFO EventLoggingListener: Logging events to file:/tmp/spark-events/1/20150907-235419-84120842-5050-11875-0000

SELECT
 ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk,ss_cdemo_sk,ss_hdemo_sk,ss_addr_sk,ss_store_sk,ss_promo_sk,ss_ticket_number,ss_quantity,ss_wholesale_cost,ss_list_price,ss_sales_price,ss_ext_discount_amt,ss_ext_sales_price,ss_ext_wholesale_cost,ss_ext_list_price,ss_ext_tax,ss_coupon_amt,ss_net_paid,ss_net_paid_inc_tax,ss_net_profit
FROM
  store_sales_text
DISTRIBUTE BY
  ss_sold_date_sk
            Exception in thread "main" java.lang.RuntimeException: [6.12] failure: \`\`union'' expected but `by' found

DISTRIBUTE BY

                        ^
        at scala.sys.package$.error(package.scala:27)
        at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
        at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
        at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
        at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
        at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
        at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
        at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
        at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
        at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
        at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
        at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
        at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
        at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
        at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
        at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
        at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
        at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
        at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
        at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
        at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
        at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
        at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
        at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166)
        at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166)
        at org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:42)
        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:189)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719)
        at com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:141)
        at com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:204)
        at com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:202)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at com.databricks.spark.sql.perf.tpcds.Tables.genData(Tables.scala:202)
        at SimpleApp$.tpc(submitApp.scala:110)
        at SimpleApp$.main(submitApp.scala:147)
        at SimpleApp.main(submitApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/09/07 23:54:26 INFO SparkContext: Invoking stop() from shutdown hook
15/09/07 23:54:26 INFO SparkUI: Stopped Spark web UI at http://10.141.3.5:4040
15/09/07 23:54:26 INFO DAGScheduler: Stopping DAGScheduler
I0907 23:54:26.484743 12297 sched.cpp:1286] Asked to stop the driver
I0907 23:54:26.484998 12275 sched.cpp:752] Stopping framework '20150907-235419-84120842-5050-11875-0000'
15/09/07 23:54:26 INFO MesosSchedulerBackend: driver.run() returned with code DRIVER_STOPPED
15/09/07 23:54:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/09/07 23:54:26 INFO MemoryStore: MemoryStore cleared
15/09/07 23:54:26 INFO BlockManager: BlockManager stopped
15/09/07 23:54:26 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/07 23:54:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/09/07 23:54:26 INFO SparkContext: Successfully stopped SparkContext
15/09/07 23:54:26 INFO ShutdownHookManager: Shutdown hook called
15/09/07 23:54:26 INFO ShutdownHookManager: Deleting directory /local/vdbogert/tmp1/spark-d25f80c9-46e9-45e9-a92a-bd29e766a1a7
Reservation 3423559 cancelled

Not support partitionTables option

I used this * tables.genData(location, format, true, true, true, true, true)* and occurred the following problem, it is ok for tables.genData(location, format, true, false, true, true, true), so this benchmark do not support partitionTables?

java.lang.RuntimeException: [7.1] failure: ``union'' expected but identifier DISTRIBUTE found

After I read the source code, I found the problem occurred on this line.
val query =
s"""
|SELECT
| $columnString
|FROM
| $tempTableName
|$predicates
|DISTRIBUTE BY
| $partitionColumnString
""".stripMargin

Is there any suggestion to solve this problem?

I run this on Spark1.5.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.