GithubHelp home page GithubHelp logo

Comments (12)

C-Newman avatar C-Newman commented on July 19, 2024

I made changes in these three files:
/etc/spark/conf/spark-env.sh export SPARK_EXECUTOR_MEMORY=4G
/etc/spark/conf.dist/spark-env.sh export SPARK_EXECUTOR_MEMORY=4G
/etc/spark/conf/spark-defaults.conf spark.executor.memory=3.5G
restarted spark, re-ran build-all.sh and run-all.sh, and I still got 2.1G

from hibench.

lvsoft avatar lvsoft commented on July 19, 2024

Please don't set configurations in spark conf, leave it blank or default. You can set spark.* properties in 99-user_defined_properties.conf, which will overwrite values in spark conf directory.

In your case, seems you'll need to set spark.storage.memoryFraction to 1 in 99-user_defined_properties.conf to let spark context utilize all your memory. By default spark.storage.memoryFraction is 0.6 and seems your server only have 4G memory, which is 4GB *0.6=2.4GB. Very close to your 2.1GB result (If count off memory occupied by linux kernel and other processes)

from hibench.

C-Newman avatar C-Newman commented on July 19, 2024

Thanks for your quick reply.

These servers have 32 GiB of memory. How does spark decide I have only ~4 GiB of memory?

Your suggestion of increasing spark.storage.memoryFraction worked; I set it to 0.9 and they are now getting 3.1 GB of memory:
15/05/29 08:58:49 INFO MemoryStore: MemoryStore started with capacity 3.1 GB

I am no longer getting the "Not enough space to cache" warnings in the container stderr files.

I'm still losing executors, however, and it appears to be related to Akka. The output on the terminal from run-all.sh has:
15/05/29 09:24:55 ERROR YarnScheduler: Lost executor 11 on anders-44: remote Akka client disassociated
15/05/29 09:24:55 WARN TaskSetManager: Lost task 22.0 in stage 7.0 (TID 274, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 25.0 in stage 7.0 (TID 277, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 267, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 18.0 in stage 7.0 (TID 270, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 23.1 in stage 7.0 (TID 282, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 21.0 in stage 7.0 (TID 273, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 269, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 26.0 in stage 7.0 (TID 278, anders-44): ExecutorLostFailure (executor 11 lost)

/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/kmeans/spark/python/bench.log has:
15/05/29 09:24:55 ERROR YarnScheduler: Lost executor 11 on anders-44: remote Akka client disassociated
15/05/29 09:24:55 INFO TaskSetManager: Re-queueing tasks for 11 from TaskSet 7.0
15/05/29 09:24:55 WARN TaskSetManager: Lost task 22.0 in stage 7.0 (TID 274, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 25.0 in stage 7.0 (TID 277, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 267, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 18.0 in stage 7.0 (TID 270, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 23.1 in stage 7.0 (TID 282, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 21.0 in stage 7.0 (TID 273, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 269, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 26.0 in stage 7.0 (TID 278, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 INFO DAGScheduler: Executor lost: 11 (epoch 10)
15/05/29 09:24:55 INFO BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster.
15/05/29 09:24:55 INFO BlockManagerMasterActor: Removing block manager BlockManagerId(11, anders-44, 51603)
15/05/29 09:24:55 INFO TaskSetManager: Starting task 26.1 in stage 7.0 (TID 290, d, NODE_LOCAL, 1950 bytes)
15/05/29 09:24:55 INFO BlockManagerMaster: Removed 11 successfully in removeExecutor
15/05/29 09:25:02 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@anders-44:47219/user/Executor#1407812609] with ID 13

From: Lv, Qi [mailto:[email protected]]
Sent: Friday, 29 May, 2015 03:36
To: intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Please don't set configurations in spark conf, leave it blank or default. You can set spark.* properties in 99-user_defined_properties.conf, which will overwrite values in spark conf directory.

In your case, seems you'll need to set spark.storage.memoryFraction to 1 in 99-user_defined_properties.conf to let spark context utilize all your memory. By default spark.storage.memoryFraction is 0.6 and seems your server only have 4G memory, which is 4GB *0.6=2.4GB. Very close to your 2.1GB result (If count off memory occupied by linux kernel and other processes)


Reply to this email directly or view it on GitHubhttps://github.com//issues/98#issuecomment-106727405.

from hibench.

lvsoft avatar lvsoft commented on July 19, 2024

Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.

Could you please attach your report/<workload>/spark/python/conf/sparkbench/sparkbench.conf and report/<workload>/spark/python/monitor.htmlfor reference?

And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs/<your container>/<executor>/stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.

from hibench.

C-Newman avatar C-Newman commented on July 19, 2024

Attached is a .tar.gz file with the files you requested.

There was nothing interesting in dmesg, but I did find interesting information in /var/log/messages

...
May 28 22:01:51 anders-41 abrt: detected unhandled Python exception
May 29 08:58:44 anders-41 abrt: detected unhandled Python exception
May 29 10:00:38 anders-41 abrt: detected unhandled Python exception in '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py'
May 29 10:00:38 anders-41 abrtd: New client connected
May 29 10:00:38 anders-41 abrtd: Directory 'pyhook-2015-05-29-10:00:38-3433' creation detected
May 29 10:00:38 anders-41 abrt-server[13789]: Saved Python crash dump of pid 3433 to /var/spool/abrt/pyhook-2015-05-29-10:00:38-3433
May 29 10:00:38 anders-41 abrtd: Executable '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 29 10:00:38 anders-41 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' exited with 1
May 29 10:00:38 anders-41 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433'
May 29 10:00:41 anders-41 abrt: detected unhandled Python exception
May 29 10:20:33 anders-41 abrt: detected unhandled Python exception
...

From: Lv, Qi [mailto:[email protected]]
Sent: Friday, 29 May, 2015 10:41
To: intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.

Could you please attach your report//spark/python/conf/sparkbench/sparkbench.conf and report//spark/python/monitor.htmlfor reference?

And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs///stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.


Reply to this email directly or view it on GitHubhttps://github.com//issues/98#issuecomment-106831188.

from hibench.

C-Newman avatar C-Newman commented on July 19, 2024

Should I be concerned about the warning I highlighted below? I see this for most, if not all, of the HiBench jobs:

pagerank with spark/python suffers a similar problem with lost executors, but finally fails with a different error:
start HadoopPreparePagerank bench
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input
Deleted /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input
Submit MapReduce Job: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop jar /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/autogen/target/autogen-4.0-SNAPSHOT-jar-with-dependencies.jar HiBench.DataGen -t pagerank -b /usr/lib/hadoop-hdfs/HiBench/Pagerank -n Input -m 6 -r 2 -p 500000 -pbalance -pbalance -o text
15/05/29 16:25:44 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/29 16:25:59 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/29 16:26:13 INFO HiBench.PagerankData: Closing pagerank data generator...
finish HadoopPreparePagerank bench
Run pagerank/spark/python
Exec script: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/spark/python/bin/run.sh
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/00-default-properties.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/10-data-scale-profile.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/99-user_defined_properties.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/00-pagerank-default.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/10-pagerank-userdefine.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/properties.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/spark/python/python.conf
start PythonSparkPagerank bench
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output
Deleted /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output
hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input
Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/pagerank/spark/python/conf/sparkbench/sparkbench.conf
Submit Spark job: /usr/lib/spark/bin/spark-submit --jars /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/pagerank/spark/python/conf/sparkbench/spark.conf --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/pagerank.py /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input/edges /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output 3
15/05/29 16:29:10 ERROR YarnScheduler: Lost executor 1 on anders-44: remote Akka client disassociated
15/05/29 16:29:10 WARN TaskSetManager: Lost task 8.0 in stage 2.0 (TID 20, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 14, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 22, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 16, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 12, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:12 ERROR YarnScheduler: Lost executor 2 on anders-41: remote Akka client disassociated
15/05/29 16:29:12 WARN TaskSetManager: Lost task 5.0 in stage 2.0 (TID 17, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 13, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 4.1 in stage 2.0 (TID 25, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 15, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 24, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 0.0 in stage 0.1 (TID 26, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 2.0 in stage 0.1 (TID 28, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 1.0 in stage 0.1 (TID 27, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:14 WARN TaskSetManager: Lost task 4.2 in stage 2.0 (TID 34, anders-44): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)

Here's the tail of the log file:
15/05/29 16:29:14 INFO BlockManagerMasterActor: Registering block manager anders-44:32963 with 3.1 GB RAM, BlockManagerId(3, anders-44, 32963)
15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on anders-44:32963 (size: 11.6 KB, free: 3.1 GB)
15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on anders-44:32963 (size: 8.1 KB, free: 3.1 GB)
15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on anders-44:32963 (size: 64.2 KB, free: 3.1 GB)
15/05/29 16:29:14 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 4 to sparkExecutor@anders-44:42351
15/05/29 16:29:14 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 4 is 86 bytes
15/05/29 16:29:14 INFO TaskSetManager: Starting task 9.1 in stage 2.0 (TID 37, anders-44, PROCESS_LOCAL, 1255 bytes)
15/05/29 16:29:14 WARN TaskSetManager: Lost task 4.2 in stage 2.0 (TID 34, anders-44): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:177)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)

join with spark/python fails differently:
start PythonSparkJoin bench
hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Join/Input
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Join/Output
Deleted /usr/lib/hadoop-hdfs/HiBench/Join/Output
Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/sparkbench/sparkbench.conf
Submit Spark job: /usr/lib/spark/bin/spark-submit --jars /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/sparkbench/spark.conf --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/python_spark_sql_bench.py PythonJoin /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/../rankings_uservisits_join.hive
15/05/29 16:25:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
15/05/29 16:25:19 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/05/29 16:25:30 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
finish PythonSparkJoin bench

From: Newman, Chuck
Sent: Friday, 29 May, 2015 16:11
To: intel-hadoop/HiBench; intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: RE: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Attached is a .tar.gz file with the files you requested.

There was nothing interesting in dmesg, but I did find interesting information in /var/log/messages

...
May 28 22:01:51 anders-41 abrt: detected unhandled Python exception
May 29 08:58:44 anders-41 abrt: detected unhandled Python exception
May 29 10:00:38 anders-41 abrt: detected unhandled Python exception in '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py'
May 29 10:00:38 anders-41 abrtd: New client connected
May 29 10:00:38 anders-41 abrtd: Directory 'pyhook-2015-05-29-10:00:38-3433' creation detected
May 29 10:00:38 anders-41 abrt-server[13789]: Saved Python crash dump of pid 3433 to /var/spool/abrt/pyhook-2015-05-29-10:00:38-3433
May 29 10:00:38 anders-41 abrtd: Executable '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 29 10:00:38 anders-41 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' exited with 1
May 29 10:00:38 anders-41 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433'
May 29 10:00:41 anders-41 abrt: detected unhandled Python exception
May 29 10:20:33 anders-41 abrt: detected unhandled Python exception
...

From: Lv, Qi [mailto:[email protected]]
Sent: Friday, 29 May, 2015 10:41
To: intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)

Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.

Could you please attach your report//spark/python/conf/sparkbench/sparkbench.conf and report//spark/python/monitor.htmlfor reference?

And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs///stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.


Reply to this email directly or view it on GitHubhttps://github.com//issues/98#issuecomment-106831188.

from hibench.

lvsoft avatar lvsoft commented on July 19, 2024

Sorry, could you attach the files in some place and placed an URL here? I can't see your attachments.

And besides, according to your /var/log/messages, the May 28 22:01:51 anders-41 abrt: detected unhandled Python exception exceptions is generated by ABRT( Automated BUG Reporting Tools), which hooked into Python to override sys.excepthook.

I'm not sure about ABRT, but seems it intercepted python exception handling and failed to produce a successful report. My suggestion is, you can uninstall ABRT, then the exception in python should be printed in console or logs, which should be a clue for tracing.

from hibench.

C-Newman avatar C-Newman commented on July 19, 2024

Qi Lv --

I was able to determine that I was encountering an OOM problem.

I changed hibench.scale.profile to tiny, and now run-all.sh runs to completion.

However, It looks like there is still a problem.

I get this from aggregation with spark/scala:

start ScalaSparkAggregation bench

hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output

Deleted /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output

Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/sparkbench.conf

Submit Spark job: /usr/lib/spark/bin/spark-submit --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/spark.conf --class com.intel.sparkbench.sql.ScalaSparkSQLBench --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar ScalaAggregation /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/../uservisits_aggre.hive

15/06/05 16:17:47 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0

15/06/05 16:17:48 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

15/06/05 16:17:54 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output

finish ScalaSparkAggregation bench

I saw these three commands in the output, so I ran them separately:
$ /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/aggregation/prepare/prepare.sh
$ /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output
$ SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/sparkbench.conf /usr/lib/spark/bin/spark-submit --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/spark.conf --class com.intel.sparkbench.sql.ScalaSparkSQLBench --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar ScalaAggregation /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/../uservisits_aggre.hive

It seemed to progress well until this point, in which it complained that the uservisits table already exists:

15/06/08 10:38:20 INFO HiveMetaStore: 0: create_table: Table(tableName:uservisits, dbName:DEFAULT, owner:hdfs, createTime:1433774299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:sourceip, type:string, comment:null), FieldSchema(name:desturl, type:string, comment:null), FieldSchema(name:visitdate, type:string, comment:null), FieldSchema(name:adrevenue, type:double, comment:null), FieldSchema(name:useragent, type:string, comment:null), FieldSchema(name:countrycode, type:string, comment:null), FieldSchema(name:languagecode, type:string, comment:null), FieldSchema(name:searchword, type:string, comment:null), FieldSchema(name:duration, type:int, comment:null)], location:/usr/lib/hadoop-hdfs/HiBench/Aggregation/Input/uservisits, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)

15/06/08 10:38:20 INFO audit: ugi=hdfs ip=unknown-ip-addr cmd=create_table: Table(tableName:uservisits, dbName:DEFAULT, owner:hdfs, createTime:1433774299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:sourceip, type:string, comment:null), FieldSchema(name:desturl, type:string, comment:null), FieldSchema(name:visitdate, type:string, comment:null), FieldSchema(name:adrevenue, type:double, comment:null), FieldSchema(name:useragent, type:string, comment:null), FieldSchema(name:countrycode, type:string, comment:null), FieldSchema(name:languagecode, type:string, comment:null), FieldSchema(name:searchword, type:string, comment:null), FieldSchema(name:duration, type:int, comment:null)], location:/usr/lib/hadoop-hdfs/HiBench/Aggregation/Input/uservisits, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)

15/06/08 10:38:20 ERROR RetryingHMSHandler: AlreadyExistsException(message:Table uservisits already exists)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1349)

    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1426)

The trace continues for quite a while, ending with this:

15/06/08 10:38:20 ERROR Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)

15/06/08 10:38:20 INFO PerfLogger: </PERFLOG method=Driver.execute start=1433774299954 end=1433774300289 duration=335 from=org.apache.hadoop.hive.ql.Driver>

15/06/08 10:38:20 INFO PerfLogger:

15/06/08 10:38:20 INFO PerfLogger: </PERFLOG method=releaseLocks start=1433774300289 end=1433774300289 duration=0 from=org.apache.hadoop.hive.ql.Driver>

15/06/08 10:38:20 ERROR HiveContext:

HIVE FAILURE OUTPUT

SET spark.sql.shuffle.partitions=6

OK

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

SET mapreduce.job.maps=6

SET mapreduce.job.reduces=2

SET hive.stats.autogather=false

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)

END HIVE FAILURE OUTPUT

Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)

   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:312)

from hibench.

mckellyln avatar mckellyln commented on July 19, 2024

Is that you C-Newman ?

from hibench.

lvsoft avatar lvsoft commented on July 19, 2024

@C-Newman yes it will complain AlreadyExistsException if you execute the command line manually just after you execute the workload via hibench's script. You'll need to remove metastore_db folder in your current pwd before you try the cmd manully, which is created by previous SparkSQL releated operation and will be cleaned in hibench's workload script.

from hibench.

C-Newman avatar C-Newman commented on July 19, 2024

mckellyln, yes it's me. I sent a reply to the email address I last had from you, then checked its domain name and it appears to have lapsed.

from hibench.

mckellyln avatar mckellyln commented on July 19, 2024

@C-Newman, so good to hear from you! I will send you an email offline.

from hibench.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.