Hi All, I am new to Spark and Scala. I have the source code for Spar

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Spark-sql-perf tutorial,about databricks/spark-sql-perf

Comments (29)

sridharpothamsetti commented on July 23, 2024 1

Hi Galvin,

Try using the code from tagv0.4.3 rather than from branch:master. It will work fine. And at the same time comment dbc_user_name related things in build.sbt to avoid errors. Latest branch contains ML code also.

Thanks.

from spark-sql-perf.

GalvinYang commented on July 23, 2024 1

I have execute the tpcds1_4 query with 92/99 passed. And write an instruction to use spark-sql-perf.
Everyone can do as the instruction if you faced any problems, here's the link:
https://galvinyang.github.io/2016/07/09/spark-sql-perf%20test/

from spark-sql-perf.

reshragh commented on July 23, 2024 1

Hi @GalvinYang
Thanks a ton for your blog. It has been super helpful especially for someone who is starting off from scratch.
But I am having trouble retrieving results if I follow the README file.
tpcds.createResultsTable() gives me createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS error
sqlContext.table("sqlPerformance") gives me org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'sparktest'.
When I try to get results from a particular run by using - sqlContext.table("sqlPerformance").filter("timestamp = 1476844414082"),
I get this -org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'sparktest'
This doesn't make sense because, at the very end of the experiment run, I got Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1476844414082.
Do you have any idea how to solve this?
Thanks in advance!

from spark-sql-perf.

hchawla1 commented on July 23, 2024

Can you paste the errors you are getting by running bin/run --benchmark DatasetPerformance ?

This is the default test suite/test case or benchmark class and once you are able to compile and run this, you will see static output.

from spark-sql-perf.

npaluskar commented on July 23, 2024

Build is incomplete. it gives me entire log as an error messages so I am not able to figure what is going wrong in the build. Execution gets stuck after certain step. PFA log.
spark-sql-perf-build-log.txt

from spark-sql-perf.

hchawla1 commented on July 23, 2024

I don't see any error.

Let the program run completely. This is not complete log.

from spark-sql-perf.

npaluskar commented on July 23, 2024

Hi All, I am getting following error
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/Dataset;

I am using Spark 1.6.1 and Scala 2.11.8 version . Do I need to change the version of scala to get it work ?

from spark-sql-perf.

hchawla1 commented on July 23, 2024

NoSuchMethodError usually means that you have incompatibility between libraries.....
I think default scala for Spark 1.6.1 is 2.10 (you can try that)

from spark-sql-perf.

npaluskar commented on July 23, 2024

I tried with both 2.10.4 and 2.10.5 . I am still facing the same issue.

from spark-sql-perf.

sridharpothamsetti commented on July 23, 2024

I am facing below issues, when I am trying to run this code. Could anyone revert on these issues to go ahead.

For this command, bin/run --benchmark DatasetPerformance-- its getting stuck for hours as in the log spark-sql-perf-build-log.txt attached by npaluskar above
I am also facing the NoSuchMethodError issue with scala 2.10.4 version and spark 1.6.1. Please let us know the resolution if any.
3)If I am using spark 2.0.0 preview version, then I am able to generate data and create external tables . But getting stuck at val tpcds = new TPCDS (sqlContext = sqlContext) statement due to scala crash as mentioned in #70

from spark-sql-perf.

npaluskar commented on July 23, 2024

For this command, bin/run --benchmark DatasetPerformance-- its getting stuck for hours as in the log spark-sql-perf-build-log.txt attached by npaluskar above --> This happened with me when I ran the command for second time I am not sure why this happens but it happens every time when you run the command for second time but when I ran it first time i had successful run . So you might want to restart the session and try again.
I am also facing the NoSuchMethodError issue with scala 2.10.4 version and spark 1.6.1. Please let us know the resolution if any. --> I am still trying to figure it out.
3)If I am using spark 2.0.0 preview version, then I am able to generate data and create external tables . But getting stuck at val tpcds = new TPCDS (sqlContext = sqlContext) statement due to scala crash as mentioned in #70 --> I am not aware of this as I am still stuck at step 2

from spark-sql-perf.

hchawla1 commented on July 23, 2024

can you verify your TPCDS.scala class:

https://github.com/databricks/spark-sql-perf/blob/v0.4.3/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala

Are you using Spark2.0..

from spark-sql-perf.

npaluskar commented on July 23, 2024

Yes . TPCDS.scala is same for me . I am using Spark 1.6.1

from spark-sql-perf.

sridharpothamsetti commented on July 23, 2024

Yes chawla. I am using same file as you mentioned and its spark2.0.0 I am using.

from spark-sql-perf.

hchawla1 commented on July 23, 2024

There are more API's in spark 2.0 (esp for spark sql perf)...

from you spark-sql-perf-master directory try sbt
it should give you command prompt ...
then type compile
and then run --benchmark DatasetPerformance

spark-sql-perf-master:> sbt

compile
[warn]....
[success]
run --benchmark DatasetPerformance

or alternately, from spark-sql-perf-master directory try ./bin/run --benchmark DatasetPerformance

from spark-sql-perf.

sridharpothamsetti commented on July 23, 2024

Yes. I used sbt to compile and created jar file for the spark-sql-perf-master and used the same to login to spark shell using command(bin/spark-shell --jars /home/cloudera/spark-sql-perf-master/target/scala-2.10/spark-sql-perf_2.10-0.4.8-SNAPSHOT.jar)

./bin/run --benchmark DatasetPerformance --ran well this time as suggested by nachiket
and ran the below commands for the experiment:

import com.databricks.spark.sql.perf.tpcds.Tables
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val tables = new Tables(sqlContext, "/home/cloudera/tpcds-kit-master/tools/", 1)
tables.genData("hdfs://192.168.126.130:8020/tmp/temp2", "parquet", false, false, false, false, false)
tables.createExternalTables("hdfs://192.168.126.128:8020/tmp/temp2", "parquet", "sparkperf", false)
// Setup TPC-DS experiment
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext) This command slain the compiler and causing the spark shell 2.0.0 restart

from spark-sql-perf.

sridharpothamsetti commented on July 23, 2024

Hi Nachiket,
I tried with spark 2.0.0 preview,scala 2.11.8(changed the build.sbt in spark-sql-perf code and compiled it) and the commands ran fine.
Thanks.

from spark-sql-perf.

GalvinYang commented on July 23, 2024

Hi, I have tried the spark-sql-perf with spark 2.0 as above, and it fails in
val tpcds = new TPCDS (sqlContext = sqlContext) This command slain the compiler and causing the spark shell 2.0.0 restart
Then I want to try to compile the jar with scala 2.11.8, change scalaVersion := "2.10.4" to "2.11.8" in build.sbt.
But it fails at the libraryDependencies += "com.typesafe" %% "scalalogging-slf4j" % "1.1.0"
The package cannot be found.
Can anyone give a solution?

from spark-sql-perf.

GalvinYang commented on July 23, 2024

Thanks for your answer，I have checked out v0.4.3 and comment the dbc related lines, then failed at compiling:

[info] Compiling 20 Scala sources to /data/ygmz/sparksqlperf/spark-sql-perf/target/scala-2.10/classes...
[warn] /data/ygmz/sparksqlperf/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/CpuProfile.scala:107: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
[warn]         case Row(stackLines: Seq[String], count: Long) => stackLines.map(toStackElement) -> count :: Nil
[warn]                              ^
[error] /data/ygmz/sparksqlperf/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala:102: object creation impossible, since:
[error] it has 2 unimplemented members.
[error] /** As seen from anonymous class $anon, the missing signatures are as follows.
[error]  *  For convenience, these are usable as stub implementations.
[error]  */
[error]   def bufferEncoder: org.apache.spark.sql.Encoder[com.databricks.spark.sql.perf.SumAndCount] = ???
[error]   def outputEncoder: org.apache.spark.sql.Encoder[Double] = ???
[error]   val average = new Aggregator[Long, SumAndCount, Double] {
[error]                     ^
[warn] one warning found
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 328 s, completed 2016-7-7 11:16:06

How to go through this?

from spark-sql-perf.

baikai commented on July 23, 2024

Hi all:

I try to generate TPC-DS data by spark-perf parallelly, but spark throw exceptions like below:
...
scala> tables.genData("hdfs://ocdpCluster/tpcds", "parquet", true, true, false, true, false)
Pre-clustering with partitioning columns with query
SELECT
cs_sold_date_sk,cs_sold_time_sk,cs_ship_date_sk,cs_bill_customer_sk,cs_bill_cdemo_sk,cs_bill_hdemo_sk,cs_bill_addr_sk,cs_ship_customer_sk,cs_ship_cdemo_sk,cs_ship_hdemo_sk,cs_ship_addr_sk,cs_call_center_sk,cs_catalog_page_sk,cs_ship_mode_sk,cs_warehouse_sk,cs_item_sk,cs_promo_sk,cs_order_number,cs_quantity,cs_wholesale_cost,cs_list_price,cs_sales_price,cs_ext_discount_amt,cs_ext_sales_price,cs_ext_wholesale_cost,cs_ext_list_price,cs_ext_tax,cs_coupon_amt,cs_ext_ship_cost,cs_net_paid,cs_net_paid_inc_tax,cs_net_paid_inc_ship,cs_net_paid_inc_ship_tax,cs_net_profit
FROM
catalog_sales_text

DISTRIBUTE BY
cs_sold_date_sk
.
Generating table catalog_sales in database to hdfs://ocdpCluster/tpcds/catalog_sales with save mode Overwrite.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
java.io.FileNotFoundException: Path is not a file: /tpcds/catalog_sales/cs_sold_date_sk=2450815
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
...

How to resolve this?

Thanks

from spark-sql-perf.

lordk911 commented on July 23, 2024

I use spark-sql-perf-0.4.3 .I got error when I gen data:
cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

from spark-sql-perf.

jameszhouyi commented on July 23, 2024

Hi @GalvinYang ,
i saw your blog which is very helpful for me to understand the spark-sql-perf tool. Now i have a question to need your help. if i used spark 1.6.2 for TPC-DS benchmark, it mean that i can't use tags/v0.4.3 since the codes are based on Spark 2.0.0, so i have to used an older version(eg,tags/v0.3.2, also set scalaVersion := "2.10.4" with sparkVersion := "1.6.2" in build.sbt) to compile and got spark-sql-perf jar to launch spark-shell to test ..?
Thanks in advance !

from spark-sql-perf.

GalvinYang commented on July 23, 2024

Hi Zhou，
Sorry for late.
I have tried it with spark 2.0 before because we need to verify the SQL support in spark 2.0. If you want to test it with spark 1.6., you can try as your method, if it cannot work, try different versions.
After all, I think it won't be necessary to test on spark 1.6. since many people have done it before which you can find on google.

At 2016-09-29 14:58:14, "Yi Zhou" [email protected] wrote:

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from spark-sql-perf.

jameszhouyi commented on July 23, 2024

Hi @GalvinYang ,
Thanks a lot for your reply and blog ! now i can compile the spark-sql-perf jar with targs/v0.3.2 after reference your experiences in blog. Your blog is very helpful for us : )

from spark-sql-perf.

jameszhouyi commented on July 23, 2024

Hi experts,
Now i am using the spark-sql-perf to generate TPC-DS 1TB data with enabling partitionTables like tables.genData("hdfs://ip:8020/tpctest", "parquet", true, true, false, false, false) . But found some of big tables(e.g., store_sales) got slower to be completed. I observed that firstly all data were put in /tpcds_1t/store_sales/_temporary/0, then move to /tpcds_1t/store_sales on HDFS, these 'move' on HDFS took a lot time to complete...If some guys came cross the same issue like me ? How to resolve it ?

Thanks in advance !

from spark-sql-perf.

wangli86 commented on July 23, 2024

@GalvinYang
hi:
I am facing below issues, when I am trying to run this code. For this command
tables.createExternalTables("file:///home/tpctest/", "parquet", "mydata", false)
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier CREATE found

CREATE DATABASE IF NOT EXISTS mydata
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
..............

I used spark-sql-perf-0.2.4 ,scala-2.10.5 spark-1.6.1;
but this commend:
tables.createTemporaryTables("file:///home/wl/tpctest/", "parquet") has no problem,
and tpcds.createResultsTable() commend has the same with tables.createExternalTables()
can you help me resove this problem?

from spark-sql-perf.

ktania commented on July 23, 2024

Hello everyone,

Need some help to run the benchmark. While executing the below query I am getting the attached exception in the spark shell. Please help me resolve this.

val experiment = tpcds.runExperiment(tpcds.interactiveQueries)

Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1489665992654
17/03/16 17:37:07 ERROR FileOutputCommitter: Mkdirs failed to create file:/spark/sql/performance/timestamp=1489665992654/_temporary/0
17/03/16 17:37:07 WARN TaskSetManager: Stage 171 contains a task of very large size (330 KB). The maximum recommended task size is 100 KB.
17/03/16 17:37:07 WARN TaskSetManager: Lost task 0.0 in stage 171.0 (TID 5124, 10.6.45.231, executor 0): java.io.IOException: Mkdirs failed to create file:/spark/sql/performance/timestamp=1489665992654/_temporary/0/_temporary/attempt_20170316173707_0171_m_000000_0 (exists=false, cwd=file:/home/taniya/spark/spark-2.1.0-bin-hadoop2.7/work/app-20170316172533-0001/0)
execution.docx

Attached is the full log.

**** The issue is resolved. The error was due to permission issue.

Thanks,
Tania

from spark-sql-perf.

ktania commented on July 23, 2024

@GalvinYang Thanks for your blog. It helped me a lot to get the test running!
@reshragh I am also facing the similar issue viewing the results. Is it resolved for you?

While retrieving results using tpcds.createResultsTable() it gives me createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS error.
And I figured out from the source code, that there is no such method as createResultsTable in TPCDS.scala.

sqlContext.table("sqlPerformance") gives me org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sqlperformance' not found in database 'xyz'. even though I got Results written to table: 'sqlPerformance' at /spark/sql/performance/timestamp=1489749887680.

I tried from the console to view the results by importing the json.

val df = spark.read.json("/spark/sql/performance/timestamp=1489749887680/part-00000-8d5f1472-0846-4ec5-81e1-358a7a271840.json")

df.show()

But I am not able to interprete the results from here.
Is there any other way to retrieve the results? Any help is highly appretiated.

Thanks in advance!

from spark-sql-perf.

dreamerHarshit commented on July 23, 2024

Hi @GalvinYang,

thanks for your blog, is this blog also available in english or any other blog like this if exist?

Thanks in advance

from spark-sql-perf.

Spark-sql-perf tutorial about spark-sql-perf HOT 29 OPEN

Comments (29)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs