GithubHelp home page GithubHelp logo

Comments (9)

marmbrus avatar marmbrus commented on July 3, 2024

It would be great to add these. I think the biggest barrier here is going to be adapting the data generation code to produce all of the fact tables. Right now, we use the streaming mode of dsdgen and I don't think you can do this for the tables that have foreign key dependencies.

from spark-sql-perf.

jfchen avatar jfchen commented on July 3, 2024

What do you think about de-coupling dsdgen from the test kit itself, and
simply provide instructions on how to run dsdgen by itself?

Is it correct to assume that most users are probably using Hive and have
created these tables and loaded data already?

For our tests, we used externally generated data on HDFS (path passed in),
created DF and used csv to load data, like this:

def importTable(sqlContext: SQLContext, filename: String, schema:
StructType, tablename: String) {
val df = sqlContext.read.format("com.databricks.spark.csv").
schema(schema).option("delimiter", "|").load(filename)
df.registerTempTable(tablename)
}

A few queries do have modifications -- thought to mention that but should
be good enough for this kit.

Will package the queries up and send it soon.

From: Michael Armbrust [email protected]
To: databricks/spark-sql-perf [email protected]
Cc: Jesse F Chen/San Francisco/IBM@IBMUS
Date: 09/17/2015 12:26 PM
Subject: Re: [spark-sql-perf] Can we put all working queries into this
test kit? There are 86 out of 99 working in Spark 1.5 (#23)

It would be great to add these. I think the biggest barrier here is going
to be adapting the data generation code to produce all of the fact tables.
Right now, we use the streaming mode of dsdgen and I don't think you can do
this for the tables that have foreign key dependencies.


Reply to this email directly or view it on GitHub.

from spark-sql-perf.

marmbrus avatar marmbrus commented on July 3, 2024

We don't necessarily need to block adding the queries on adding the data generation, but in my experience generating larger scale factors (SF1500 - SF15000) is actually a significant challenge. So I would defiantly like to add support for generating them in the context of a Spark job.

from spark-sql-perf.

jfchen avatar jfchen commented on July 3, 2024

Definitely nice to have data generation done in a Spark job. What the best way to upload a gzip file containing all 86 queries in text files?

from spark-sql-perf.

marmbrus avatar marmbrus commented on July 3, 2024

I wouldn't upload them as a zip file. I'd do one of the following:

  • Add the files in src/main/resources/... and create a harness that reads them from the classloader and creates query objects for each. Put this as another trait in the tpcds directory.
  • Hard code them as strings as we have in the other tpcds files

from spark-sql-perf.

0x0ece avatar 0x0ece commented on July 3, 2024

Do you have any update in this? I'd be interested in testing out the new queries... Thanks! E.

from spark-sql-perf.

jfchen avatar jfchen commented on July 3, 2024

This is still being worked on. Stay tuned please. We will implement this as first option from Michael's comment above - that makes sense.

from spark-sql-perf.

0x0ece avatar 0x0ece commented on July 3, 2024

News? :) I may have some free time in the next days, if you could PR the queries I can have a look at how to add some scala glue... thanks!

from spark-sql-perf.

0x0ece avatar 0x0ece commented on July 3, 2024

Great job!

from spark-sql-perf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.