Comments (12)
I made changes in these three files:
/etc/spark/conf/spark-env.sh export SPARK_EXECUTOR_MEMORY=4G
/etc/spark/conf.dist/spark-env.sh export SPARK_EXECUTOR_MEMORY=4G
/etc/spark/conf/spark-defaults.conf spark.executor.memory=3.5G
restarted spark, re-ran build-all.sh and run-all.sh, and I still got 2.1G
from hibench.
Please don't set configurations in spark conf, leave it blank or default. You can set spark.*
properties in 99-user_defined_properties.conf
, which will overwrite values in spark conf directory.
In your case, seems you'll need to set spark.storage.memoryFraction
to 1 in 99-user_defined_properties.conf
to let spark context utilize all your memory. By default spark.storage.memoryFraction
is 0.6 and seems your server only have 4G memory, which is 4GB *0.6=2.4GB. Very close to your 2.1GB result (If count off memory occupied by linux kernel and other processes)
from hibench.
Thanks for your quick reply.
These servers have 32 GiB of memory. How does spark decide I have only ~4 GiB of memory?
Your suggestion of increasing spark.storage.memoryFraction worked; I set it to 0.9 and they are now getting 3.1 GB of memory:
15/05/29 08:58:49 INFO MemoryStore: MemoryStore started with capacity 3.1 GB
I am no longer getting the "Not enough space to cache" warnings in the container stderr files.
I'm still losing executors, however, and it appears to be related to Akka. The output on the terminal from run-all.sh has:
15/05/29 09:24:55 ERROR YarnScheduler: Lost executor 11 on anders-44: remote Akka client disassociated
15/05/29 09:24:55 WARN TaskSetManager: Lost task 22.0 in stage 7.0 (TID 274, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 25.0 in stage 7.0 (TID 277, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 267, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 18.0 in stage 7.0 (TID 270, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 23.1 in stage 7.0 (TID 282, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 21.0 in stage 7.0 (TID 273, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 269, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 26.0 in stage 7.0 (TID 278, anders-44): ExecutorLostFailure (executor 11 lost)
/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/kmeans/spark/python/bench.log has:
15/05/29 09:24:55 ERROR YarnScheduler: Lost executor 11 on anders-44: remote Akka client disassociated
15/05/29 09:24:55 INFO TaskSetManager: Re-queueing tasks for 11 from TaskSet 7.0
15/05/29 09:24:55 WARN TaskSetManager: Lost task 22.0 in stage 7.0 (TID 274, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 25.0 in stage 7.0 (TID 277, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 267, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 18.0 in stage 7.0 (TID 270, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 23.1 in stage 7.0 (TID 282, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 21.0 in stage 7.0 (TID 273, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 269, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 WARN TaskSetManager: Lost task 26.0 in stage 7.0 (TID 278, anders-44): ExecutorLostFailure (executor 11 lost)
15/05/29 09:24:55 INFO DAGScheduler: Executor lost: 11 (epoch 10)
15/05/29 09:24:55 INFO BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster.
15/05/29 09:24:55 INFO BlockManagerMasterActor: Removing block manager BlockManagerId(11, anders-44, 51603)
15/05/29 09:24:55 INFO TaskSetManager: Starting task 26.1 in stage 7.0 (TID 290, d, NODE_LOCAL, 1950 bytes)
15/05/29 09:24:55 INFO BlockManagerMaster: Removed 11 successfully in removeExecutor
15/05/29 09:25:02 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@anders-44:47219/user/Executor#1407812609] with ID 13
From: Lv, Qi [mailto:[email protected]]
Sent: Friday, 29 May, 2015 03:36
To: intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)
Please don't set configurations in spark conf, leave it blank or default. You can set spark.* properties in 99-user_defined_properties.conf, which will overwrite values in spark conf directory.
In your case, seems you'll need to set spark.storage.memoryFraction to 1 in 99-user_defined_properties.conf to let spark context utilize all your memory. By default spark.storage.memoryFraction is 0.6 and seems your server only have 4G memory, which is 4GB *0.6=2.4GB. Very close to your 2.1GB result (If count off memory occupied by linux kernel and other processes)
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/98#issuecomment-106727405.
from hibench.
Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.
Could you please attach your report/<workload>/spark/python/conf/sparkbench/sparkbench.conf
and report/<workload>/spark/python/monitor.html
for reference?
And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs/<your container>/<executor>/stderr
, which should contains the last logs before executor lost. And also check dmesg
for Linux kernel messages to ensure the executor not killed by kernel due to OOM.
from hibench.
Attached is a .tar.gz file with the files you requested.
There was nothing interesting in dmesg, but I did find interesting information in /var/log/messages
...
May 28 22:01:51 anders-41 abrt: detected unhandled Python exception
May 29 08:58:44 anders-41 abrt: detected unhandled Python exception
May 29 10:00:38 anders-41 abrt: detected unhandled Python exception in '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py'
May 29 10:00:38 anders-41 abrtd: New client connected
May 29 10:00:38 anders-41 abrtd: Directory 'pyhook-2015-05-29-10:00:38-3433' creation detected
May 29 10:00:38 anders-41 abrt-server[13789]: Saved Python crash dump of pid 3433 to /var/spool/abrt/pyhook-2015-05-29-10:00:38-3433
May 29 10:00:38 anders-41 abrtd: Executable '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 29 10:00:38 anders-41 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' exited with 1
May 29 10:00:38 anders-41 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433'
May 29 10:00:41 anders-41 abrt: detected unhandled Python exception
May 29 10:20:33 anders-41 abrt: detected unhandled Python exception
...
From: Lv, Qi [mailto:[email protected]]
Sent: Friday, 29 May, 2015 10:41
To: intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)
Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.
Could you please attach your report//spark/python/conf/sparkbench/sparkbench.conf and report//spark/python/monitor.htmlfor reference?
And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs///stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/98#issuecomment-106831188.
from hibench.
Should I be concerned about the warning I highlighted below? I see this for most, if not all, of the HiBench jobs:
pagerank with spark/python suffers a similar problem with lost executors, but finally fails with a different error:
start HadoopPreparePagerank bench
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input
Deleted /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input
Submit MapReduce Job: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop jar /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/autogen/target/autogen-4.0-SNAPSHOT-jar-with-dependencies.jar HiBench.DataGen -t pagerank -b /usr/lib/hadoop-hdfs/HiBench/Pagerank -n Input -m 6 -r 2 -p 500000 -pbalance -pbalance -o text
15/05/29 16:25:44 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/29 16:25:59 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/29 16:26:13 INFO HiBench.PagerankData: Closing pagerank data generator...
finish HadoopPreparePagerank bench
Run pagerank/spark/python
Exec script: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/spark/python/bin/run.sh
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/00-default-properties.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/10-data-scale-profile.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/conf/99-user_defined_properties.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/00-pagerank-default.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/10-pagerank-userdefine.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/conf/properties.conf
Parsing conf: /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/pagerank/spark/python/python.conf
start PythonSparkPagerank bench
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output
Deleted /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output
hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input
Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/pagerank/spark/python/conf/sparkbench/sparkbench.conf
Submit Spark job: /usr/lib/spark/bin/spark-submit --jars /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/pagerank/spark/python/conf/sparkbench/spark.conf --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/pagerank.py /usr/lib/hadoop-hdfs/HiBench/Pagerank/Input/edges /usr/lib/hadoop-hdfs/HiBench/Pagerank/Output 3
15/05/29 16:29:10 ERROR YarnScheduler: Lost executor 1 on anders-44: remote Akka client disassociated
15/05/29 16:29:10 WARN TaskSetManager: Lost task 8.0 in stage 2.0 (TID 20, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 14, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 22, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 16, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:10 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 12, anders-44): ExecutorLostFailure (executor 1 lost)
15/05/29 16:29:12 ERROR YarnScheduler: Lost executor 2 on anders-41: remote Akka client disassociated
15/05/29 16:29:12 WARN TaskSetManager: Lost task 5.0 in stage 2.0 (TID 17, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 13, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 4.1 in stage 2.0 (TID 25, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 15, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 24, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 0.0 in stage 0.1 (TID 26, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 2.0 in stage 0.1 (TID 28, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:12 WARN TaskSetManager: Lost task 1.0 in stage 0.1 (TID 27, anders-41): ExecutorLostFailure (executor 2 lost)
15/05/29 16:29:14 WARN TaskSetManager: Lost task 4.2 in stage 2.0 (TID 34, anders-44): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
Here's the tail of the log file:
15/05/29 16:29:14 INFO BlockManagerMasterActor: Registering block manager anders-44:32963 with 3.1 GB RAM, BlockManagerId(3, anders-44, 32963)
15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on anders-44:32963 (size: 11.6 KB, free: 3.1 GB)
15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on anders-44:32963 (size: 8.1 KB, free: 3.1 GB)
15/05/29 16:29:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on anders-44:32963 (size: 64.2 KB, free: 3.1 GB)
15/05/29 16:29:14 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 4 to sparkExecutor@anders-44:42351
15/05/29 16:29:14 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 4 is 86 bytes
15/05/29 16:29:14 INFO TaskSetManager: Starting task 9.1 in stage 2.0 (TID 37, anders-44, PROCESS_LOCAL, 1255 bytes)
15/05/29 16:29:14 WARN TaskSetManager: Lost task 4.2 in stage 2.0 (TID 34, anders-44): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:177)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
join with spark/python fails differently:
start PythonSparkJoin bench
hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Join/Input
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Join/Output
Deleted /usr/lib/hadoop-hdfs/HiBench/Join/Output
Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/sparkbench/sparkbench.conf
Submit Spark job: /usr/lib/spark/bin/spark-submit --jars /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/sparkbench/spark.conf --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/python_spark_sql_bench.py PythonJoin /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/join/spark/python/conf/../rankings_uservisits_join.hive
15/05/29 16:25:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
15/05/29 16:25:19 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/05/29 16:25:30 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
finish PythonSparkJoin bench
From: Newman, Chuck
Sent: Friday, 29 May, 2015 16:11
To: intel-hadoop/HiBench; intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: RE: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)
Attached is a .tar.gz file with the files you requested.
There was nothing interesting in dmesg, but I did find interesting information in /var/log/messages
...
May 28 22:01:51 anders-41 abrt: detected unhandled Python exception
May 29 08:58:44 anders-41 abrt: detected unhandled Python exception
May 29 10:00:38 anders-41 abrt: detected unhandled Python exception in '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py'
May 29 10:00:38 anders-41 abrtd: New client connected
May 29 10:00:38 anders-41 abrtd: Directory 'pyhook-2015-05-29-10:00:38-3433' creation detected
May 29 10:00:38 anders-41 abrt-server[13789]: Saved Python crash dump of pid 3433 to /var/spool/abrt/pyhook-2015-05-29-10:00:38-3433
May 29 10:00:38 anders-41 abrtd: Executable '/var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/src/main/python/kmeans.py' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 29 10:00:38 anders-41 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433' exited with 1
May 29 10:00:38 anders-41 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2015-05-29-10:00:38-3433'
May 29 10:00:41 anders-41 abrt: detected unhandled Python exception
May 29 10:20:33 anders-41 abrt: detected unhandled Python exception
...
From: Lv, Qi [mailto:[email protected]]
Sent: Friday, 29 May, 2015 10:41
To: intel-hadoop/HiBench
Cc: Newman, Chuck
Subject: Re: [HiBench] kmeans & target=spark/python: storage.MemoryStore: Not enough space (#98)
Oh, I see. However, I don't known where the 4G memory comes from. The default configuration of Spark is 512MB for each executor, and HiBench set that value to 8GB by default.
Could you please attach your report//spark/python/conf/sparkbench/sparkbench.conf and report//spark/python/monitor.htmlfor reference?
And the AKKA problem is also caused by executor lost, which still maybe the OOM issue. Can you go to the executor node, and try to find <YARN_HOME>/logs/userLogs///stderr, which should contains the last logs before executor lost. And also check dmesg for Linux kernel messages to ensure the executor not killed by kernel due to OOM.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/98#issuecomment-106831188.
from hibench.
Sorry, could you attach the files in some place and placed an URL here? I can't see your attachments.
And besides, according to your /var/log/messages
, the May 28 22:01:51 anders-41 abrt: detected unhandled Python exception
exceptions is generated by ABRT( Automated BUG Reporting Tools), which hooked into Python to override sys.excepthook
.
I'm not sure about ABRT, but seems it intercepted python exception handling and failed to produce a successful report. My suggestion is, you can uninstall ABRT, then the exception in python should be printed in console or logs, which should be a clue for tracing.
from hibench.
Qi Lv --
I was able to determine that I was encountering an OOM problem.
I changed hibench.scale.profile to tiny, and now run-all.sh runs to completion.
However, It looks like there is still a problem.
I get this from aggregation with spark/scala:
start ScalaSparkAggregation bench
hdfs rm -r: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output
Deleted /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output
Export env: SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/sparkbench.conf
Submit Spark job: /usr/lib/spark/bin/spark-submit --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/spark.conf --class com.intel.sparkbench.sql.ScalaSparkSQLBench --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar ScalaAggregation /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/../uservisits_aggre.hive
15/06/05 16:17:47 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0
15/06/05 16:17:48 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
15/06/05 16:17:54 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hdfs du -s: /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -du -s /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output
finish ScalaSparkAggregation bench
I saw these three commands in the output, so I ran them separately:
$ /var/lib/hadoop-hdfs/HiBench/HiBench-master/workloads/aggregation/prepare/prepare.sh
$ /usr/lib/hadoop/bin/hadoop --config /usr/lib/hadoop/etc/hadoop fs -rm -r -skipTrash /usr/lib/hadoop-hdfs/HiBench/Aggregation/Output
$ SPARKBENCH_PROPERTIES_FILES=/var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/sparkbench.conf /usr/lib/spark/bin/spark-submit --properties-file /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/sparkbench/spark.conf --class com.intel.sparkbench.sql.ScalaSparkSQLBench --master yarn-client --num-executors 3 --executor-cores 8 --executor-memory 4G --driver-memory 4G /var/lib/hadoop-hdfs/HiBench/HiBench-master/src/sparkbench/target/sparkbench-4.0-SNAPSHOT-MR2-spark1.3-jar-with-dependencies.jar ScalaAggregation /var/lib/hadoop-hdfs/HiBench/HiBench-master/report/aggregation/spark/scala/conf/../uservisits_aggre.hive
It seemed to progress well until this point, in which it complained that the uservisits table already exists:
15/06/08 10:38:20 INFO HiveMetaStore: 0: create_table: Table(tableName:uservisits, dbName:DEFAULT, owner:hdfs, createTime:1433774299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:sourceip, type:string, comment:null), FieldSchema(name:desturl, type:string, comment:null), FieldSchema(name:visitdate, type:string, comment:null), FieldSchema(name:adrevenue, type:double, comment:null), FieldSchema(name:useragent, type:string, comment:null), FieldSchema(name:countrycode, type:string, comment:null), FieldSchema(name:languagecode, type:string, comment:null), FieldSchema(name:searchword, type:string, comment:null), FieldSchema(name:duration, type:int, comment:null)], location:/usr/lib/hadoop-hdfs/HiBench/Aggregation/Input/uservisits, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)
15/06/08 10:38:20 INFO audit: ugi=hdfs ip=unknown-ip-addr cmd=create_table: Table(tableName:uservisits, dbName:DEFAULT, owner:hdfs, createTime:1433774299, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:sourceip, type:string, comment:null), FieldSchema(name:desturl, type:string, comment:null), FieldSchema(name:visitdate, type:string, comment:null), FieldSchema(name:adrevenue, type:double, comment:null), FieldSchema(name:useragent, type:string, comment:null), FieldSchema(name:countrycode, type:string, comment:null), FieldSchema(name:languagecode, type:string, comment:null), FieldSchema(name:searchword, type:string, comment:null), FieldSchema(name:duration, type:int, comment:null)], location:/usr/lib/hadoop-hdfs/HiBench/Aggregation/Input/uservisits, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{EXTERNAL=TRUE}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE, privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, rolePrivileges:null), temporary:false)
15/06/08 10:38:20 ERROR RetryingHMSHandler: AlreadyExistsException(message:Table uservisits already exists)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1349)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1426)
The trace continues for quite a while, ending with this:
15/06/08 10:38:20 ERROR Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)
15/06/08 10:38:20 INFO PerfLogger: </PERFLOG method=Driver.execute start=1433774299954 end=1433774300289 duration=335 from=org.apache.hadoop.hive.ql.Driver>
15/06/08 10:38:20 INFO PerfLogger:
15/06/08 10:38:20 INFO PerfLogger: </PERFLOG method=releaseLocks start=1433774300289 end=1433774300289 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/06/08 10:38:20 ERROR HiveContext:
HIVE FAILURE OUTPUT
SET spark.sql.shuffle.partitions=6
OK
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
SET mapreduce.job.maps=6
SET mapreduce.job.reduces=2
SET hive.stats.autogather=false
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)
END HIVE FAILURE OUTPUT
Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. AlreadyExistsException(message:Table uservisits already exists)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:312)
from hibench.
Is that you C-Newman ?
from hibench.
@C-Newman yes it will complain AlreadyExistsException if you execute the command line manually just after you execute the workload via hibench's script. You'll need to remove metastore_db
folder in your current pwd
before you try the cmd manully, which is created by previous SparkSQL releated operation and will be cleaned in hibench's workload script.
from hibench.
mckellyln, yes it's me. I sent a reply to the email address I last had from you, then checked its domain name and it appears to have lapsed.
from hibench.
@C-Newman, so good to hear from you! I will send you an email offline.
from hibench.
Related Issues (20)
- can you add me pls
- I am facing the below mentioned issue while trying to execute the pagerank algorithm in hadoop from hibench. Please give a solution. HOT 1
- upgrade kafka version
- Hi
- HiveError:HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient HOT 1
- Does Hibench Suite work with spark version 3.2.1 ? HOT 1
- Hi
- Error when run sql test
- Does Hibench Suite support Hadoop version 3.x ? HOT 1
- i don't no what i do -_- HOT 2
- Could tasks submitted to Spark by Hibench be viewed in Spark's History Server?
- Issues with run-sparkbench on Google Cloud Platform (GCP)
- spark config value modification does not apply.
- for single node hadoop configuration
- Hi
- The
- ERROR [26/26] RUN cd /root/HiBench && mvn clean package -Dspark=1.6 -Dscala=2.10
- Attempt to run HiBench on JDK 17 or higher versions HOT 1
- Hi
- Runstreaming job failed with error in bench.log like "org.apache.spark.SparkException: Error getting partition metadata for 'identity'. Does the topic exist?"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hibench.