linkedin / dr-elephant Goto Github PK
View Code? Open in Web Editor NEWDr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
License: Apache License 2.0
Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
License: Apache License 2.0
As we start our journey down the road to Tez, it would be good if Dr. Elephant could help us with those jobs, too. This is totally a feature request.
DrElephant is not able to fetch the Spark history logs in Yarn HA cluster by setting the namenode_addresses, below are the configs :
<params>
<event_log_size_limit_in_mb>
100</event_log_size_limit_in_mb>
<event_log_dir>
/user/spark/jobhistory</event_log_dir>
<spark_log_ext>
_1</spark_log_ext>
#the values specified in namenode_addresses will be used for obtaining spark logs. The cluster configuration will be ignored.
<namenode_addresses>
hahdfs1.hostname:50070, hahdfs2.hostname:50070</namenode_addresses>
</params>
But it works with webhdfs if I specifically go for current active namenode, below are the configs:
<params>
<event_log_size_limit_in_mb>
100</event_log_size_limit_in_mb>
<event_log_dir>
webhdfs://hahdfs1.hostname.net:50070/user/spark/jobhistory</event_log_dir>
<event_log_dir>
/user/spark/jobhistory</event_log_dir>
<spark_log_ext>
_1</spark_log_ext>
</params>
Error logs:
08-01-2016 21:45:13 ERROR [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /user/spark/jobhistory/application_1460147926973_0091_1
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1636)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:231)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:181)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /user/spark/jobhistory/application_1460147926973_0091_1
at sun.reflect.GeneratedConstructorAccessor25.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:385)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:91)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:656)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:622)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:458)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:487)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:483)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:838)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:853)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:324)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:242)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
The current project structure:
dr-elephant
└app
└app-conf
└conf
└project
...
what about this structure:
dr-elephant
└dr-elephant-main
└app
└conf
└dr-elephant-dist
└scripts
└app-conf
The new structure may help new comers understand the project quickly and smoothly.
this is the pid:
[root@h0045150 ~]# ps -ef | grep elephant
root 1289 1 1 Apr26 ? 00:16:36 /usr/server/jdk/bin/java -Xms1024m -Xmx1024m -XX:MaxPermSize=256m -XX:ReservedCodeCacheSize=128m -Duser.dir=/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT -Devolutionplugin=enabled -DapplyEvolutions.default=true -Djava.library.path=/usr/server/hadoop/lib/native -Dhttp.port=8079 -Ddb.default.url=jdbc:mysql://localhost/drelephant?characterEncoding=UTF-8 -Ddb.default.user=root -Ddb.default.password=admin123 -cp /root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/dr-elephant.dr-elephant-2.0.3-SNAPSHOT.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-java-jdbc_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-jdbc_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.sbt-link-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.javassist.javassist-3.18.0-GA.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-exceptions-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.templates_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.github.scala-incubator.io.scala-io-file_2.10-0.4.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.github.scala-incubator.io.scala-io-core_2.10-0.4.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.jsuereth.scala-arm_2.10-1.3.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-iteratees_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.scala-stm.scala-stm_2.10-0.7.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-json_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-functional_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.play.play-datacommons_2.10-2.2.6.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/joda-time.joda-time-2.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.joda.joda-convert-1.3.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.typesafe.netty.netty-http-pipelining-1.1.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.slf4j.slf4j-api-1.7.5.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/ch.qos.logback.logback-core-1.0.13.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/ch.qos.logback.logback-classic-1.0.13.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.apache.commons.commons-lang3-3.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/com.ning.async-http-client-1.7.18.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/oauth.signpost.signpost-core-1.2.1.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/commons-codec.commons-codec-1.3.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/oauth.signpost.signpost-commonshttp4-1.2.1.2.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.apache.httpcomponents.httpcore-4.0.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/org.apache.httpcomponents.httpclient-4.0.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/commons-logging.commons-logging-1.1.1.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/xerces.xercesImpl-2.11.0.jar:/root/dr-elephant/dr-elephant-master/dist/dr-elephant-2.0.3-SNAPSHOT/lib/xml-apis.xml-apis-1.4
But when i Use browser access ‘http://ip:port’ I get this exception:
2016-04-27 08:58:41,717 - [ERROR] - from play.nettyException in New I/O worker #7
Exception caught in Netty
java.lang.NoClassDefFoundError: Could not initialize class play.api.libs.concurrent.Execution$
at play.core.server.netty.PlayDefaultUpstreamHandler.handleAction$1(PlayDefaultUpstreamHandler.scala:201) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:174) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at com.typesafe.netty.http.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:62) ~[com.typesafe.netty.netty-http-pipelining-1.1.2.jar:na]
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[io.netty.netty-3.8.0.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
Relevant segment of my FetcherConf.xml:
<fetcher>
<applicationtype>spark</applicationtype>
<classname>org.apache.spark.deploy.history.SparkFSFetcher</classname>
<params>
<event_log_dir>hdfs:///var/log/spark/apps</event_log_dir>
</params>
</fetcher>
And now the error
05-27-2016 21:25:51 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.io.FileNotFoundException: File does not exist: /var/log/spark/apps/application_1464108366156_0167_1.snappy
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:189)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:55)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: /var/log/spark/apps/application_1464108366156_0167_1.snappy
at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:390)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$600(WebHdfsFileSystem.java:90)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:661)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:627)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:463)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:492)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:488)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:843)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:858)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$shouldThrottle(SparkFSFetcher.scala:323)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:241)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:189)
... 13 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /var/log/spark/apps/application_1464108366156_0167_1.snappy
at org.apache.hadoop.hdfs.web.JsonUtil.toRemoteException(JsonUtil.java:112)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:358)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$200(WebHdfsFileSystem.java:90)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:613)
... 24 more
Having an issue figuring out this error, not sure where to look.
sudo $DR_RELEASE/bin/start.sh $ELEPHANT_CONF_DIR
Using config dir: /home/kfedotov/dr-elephant-master/app-conf
Using config file: /home/kfedotov/dr-elephant-master/app-conf/elephant.conf
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8080
This is hadoop2.x grid. Add Java library path: /lib/native
Starting Dr. Elephant ....
Dr. Elephant started.
Checking the log:
tail -f logs/application.log
13 16:56:18,412 - [INFO] - from play in main
database [default] connected at jdbc:mysql://localhost/drelephant?characterEncoding=UTF-8
2016-04-13 16:56:19,588 - [ERROR] - from play in main
I have play version 2.2.1
play 2.2.1 built with Scala 2.10.2 (running Java 1.8.0_77), http://www.playframework.com
Thanks
When I configure the dr-elephant with spark, why I can't see the spark jobs?
in com.linkedin.drelephant.util.Utils#parseJavaOptions, we assume that java options are in format of "-Dfoo=bar -Dfoo2=bar ...". So Dr. Elephant failed to parse options like "-Dcom.sun.management.jmxremote" or "-XX:PermSize=64m", and raise an IllegalArgumentException in ERROR log.
Should these options be ignored? Or are they valuable in analysis?
Here are two related error logs:
04-19-2016 19:25:36 ERROR com.linkedin.drelephant.util.InfoExtractor : Encountered error while parsing java options into urls: Cannot parse java option string [-Djava.util.logging.config.file=jmx.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=0 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native -XX:PermSize=64m -XX:MaxPermSize=256m]. The part [-Dcom.sun.management.jmxremote] does not contain a =.
04-20-2016 00:17:27 ERROR com.linkedin.drelephant.util.InfoExtractor : Encountered error while parsing java options into urls: Cannot parse java option string [-Djava.util.logging.config.file=jmx.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=0 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.library.path=/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native -XX:PermSize=64m -XX:MaxPermSize=256m]. Some options does not begin with -D prefix.
Hi ,
I am very new to github site. I download the zip file "dr-elephant-master.zip" and trying to install in my YARN cluster. Do I need to install dr-elephant on RM node or I can install in spare node which is part of our hadoop cluster ?
Secondly, when I try to compile it complaining below errors, any help is appreciated.
./compile.sh
Using the default configuration
Hadoop Version : 2.3.0
Spark Version : 1.4.0
Other opts set :
./compile.sh: line 27: play: command not found
./compile.sh: line 94: cd: target/universal: No such file or directory
inflating: dr-elephant-master/.gitignore
inflating: dr-elephant-master/LICENSE
inflating: dr-elephant-master/NOTICE
..
..
inflating: dr-elephant-master/test/rest/RestAPITest.java
chmod: cannot access dr-elephant-master/bin/dr-elephant': No such file or directory sed: can't read dr-elephant-master/bin/dr-elephant: No such file or directory cp: cannot create regular file
dr-elephant-master/bin/': Is a directory
cp: cannot create regular file `dr-elephant-master/bin/': Is a directory
adding: dr-elephant-master/ (stored 0%)
..
..
adding: dr-elephant-master/conf/evolutions/default/1.sql (deflated 72%)
adding: dr-elephant-master/conf/log4j.properties (deflated 45%)
adding: dr-elephant-master/conf/routes (deflated 64%)
While building latest master (1bd8f98) I've found that the example compile.conf
file in Developer Guilde is misleading/incorrect. Instead of:
hadoop_version = 2.3.0 // The Hadoop version to compile with
spark_version = 1.4.0 // The Spark version to compile with
play_opts="-Dsbt.repository.config=app-conf/resolver.conf" // Other play/sbt options
it should be:
hadoop_version=2.3.0 # The Hadoop version to compile with
spark_version=1.4.0 # The Spark version to compile with
play_opts="-Dsbt.repository.config=app-conf/resolver.conf" # Other play/sbt options
Otherwise reading incorrect configuration fails silently, like so:
Reading from config file...
hadoop_version=2.6.0 # The Hadoop version to compile with
compile.conf: line 1: hadoop_version: command not found
compile.conf: line 2: spark_version: command not found
and keeps working with the defaults:
Hadoop Version : 2.3.0
Spark Version : 1.4.0
Other opts set :
+ trap exit SIGINT SIGTERM
+++ dirname ./compile.sh
++ cd .
++ pwd
+ project_root=/Users/ljank/sandbox/dr-elephant
+ cd /Users/ljank/sandbox/dr-elephant
+ start_script=/Users/ljank/sandbox/dr-elephant/scripts/start.sh
+ stop_script=/Users/ljank/sandbox/dr-elephant/scripts/stop.sh
+ rm -rf /Users/ljank/sandbox/dr-elephant/dist
+ mkdir dist
+ play_command -Dhadoopversion=2.3.0 -Dsparkversion=1.4.0 clean test compile dist
+ type activator
+ play -Dhadoopversion=2.3.0 -Dsparkversion=1.4.0 clean test compile dist
SparkFSFetcher as a plugin in this project,how cloud we make sure to get the variable “spark.eventLog.dir” by using “new SparkConf()” ?
I find “new SparkConf()” uses "java.lang.System.getProperties()" to get the spark config items.
But env. in diffrent environment differ in thousands of ways, i cant get this variable in CDH cluster without spark-submit . do you have any idea?
i wonder if the start.sh will do something like the spark-submit script does?
Can someone explain to me what the AggregatedMetrics
branch is for? Why isn't stuff just being merged into master?
I got this error while compiling:
[warn] /home/test/dr-elephant/app/org/apache/spark/deploy/history/SparkDataCollection.scala:300: abstract type pattern T is unchecked since it is eliminated by erasure [warn] seq.foreach { case (item: T) => list.add(item)} [warn] ^ [error] /home/test/dr-elephant/app/org/apache/spark/deploy/history/SparkFSFetcher.scala:260: too many arguments for method replay: (logData: java.io.InputStream, sourceName: String)Unit [error] replayBus.replay(logInput, logPath.toString(), false) [error] ^ [warn] one warning found [error] one error found [error] (compile:compile) Compilation failed [error] Total time: 21 s, completed 27 May, 2016 4:05:18 PM
Any idea?
When deploy on a product cluster, the analysis speed is slower than the jobs finish speed! Our cluster runs around 20 thousands jobs every day. But dr. elephant can only analysis around 14 thousands of them.
I have increased the consumer thread number to 30 or larger, it didn't help. The cpu usage, memory usage and network usage were still in low rate. I found that the bottleneck is the job history server. If dr. elephant can fetch data directly from HDFS, the analysis rate may increase.
It's been close to 3 months since the open sourcing of Dr. Elephant and we have received tremendous response from the community and many companies have already started adopting Dr. Elephant. At Linkedin, Dr. Elephant has been successfully running for more than 2 years, analyzing over a hundred thousand jobs everyday.
As Dr. Elephant continues to grow, it will be good to track which companies are using Dr. Elephant. This will encourage others to try Dr. Elephant and help build the community.
Request you to send an email to me or reply to this issue with the github handle of the active members working on Dr. Elephant, company name and a link to a pull request. I'll update this in the README file of Dr. Elephant.
@stiga-huang @krishnap @paulbramsen @tglstory @ljank @plypaul @hongbozeng @liyintang @brandtg @timyitong @chetnachaudhari @cjuexuan @rsprabery @miloveme @anspuli @aNutForAJarOfTuna
Thanks and cheer to everyone.
Akshay Rai
While fetching spark event logs in HDFS, dr. elephant just get the namenode address by dfs.namenode.http-address in startup. This property may be empty when using HDFS HA.
Anyone have time to add this feature?
Hi all
I've encountered some issues like below. Any one can help this?
ENV: hdp 2.4.2
btw, I compiled successfully following below steps:
play.api.Application$$anon$1: Execution exception[[RuntimeException: Could not find class com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2]]
at play.api.Application$class.handleError(Application.scala:293) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.api.DefaultApplication.handleError(Application.scala:399) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.Option.map(Option.scala:145) [org.scala-lang.scala-library-2.10.5.jar:na]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:257) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:344) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$recoverWith$1.apply(Future.scala:343) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.5.jar:na]
at play.api.libs.iteratee.Execution$$anon$1.execute(Execution.scala:43) [com.typesafe.play.play-iteratees_2.10-2.2.2.jar:2.2.2]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1361) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [org.scala-lang.scala-library-2.10.5.jar:na]
Caused by: java.lang.RuntimeException: Could not find class com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:173) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:103) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.(ElephantContext.java:98) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.instance(ElephantContext.java:91) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.searchPage$.apply(searchPage.template.scala:89) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.searchPage$.render(searchPage.template.scala:152) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.searchPage.render(searchPage.template.scala) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at controllers.Application.search(Application.java:273) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:133) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:na]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:133) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:na]
at play.core.Router$HandlerInvoker$$anon$7$$anon$2.invocation(Router.scala:183) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.Router$Routes$$anon$1.invocation(Router.scala:377) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:56) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.GlobalSettings$1.call(GlobalSettings.java:64) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:91) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:90) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251) ~[org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) ~[org.scala-lang.scala-library-2.10.5.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.5.jar:na]
at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:37) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
... 4 common frames omitted
Caused by: java.lang.ClassNotFoundException: com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366) ~[na:1.7.0_67]
at java.net.URLClassLoader$1.run(URLClassLoader.java:355) ~[na:1.7.0_67]
at java.security.AccessController.doPrivileged(Native Method) ~[na:1.7.0_67]
at java.net.URLClassLoader.findClass(URLClassLoader.java:354) ~[na:1.7.0_67]
at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ~[na:1.7.0_67]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) ~[na:1.7.0_67]
at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ~[na:1.7.0_67]
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:159) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
... 27 common frames omitted
Hi,
My goal is to setup a development environment on my Mac, so that I can add some additional functionality for our needs.
Please let me know the best approach to do that.
I want do something like what we do with a typical play project, where I have my project running (play run) and I make edits via eclipse (after doing play eclipse) and automatically the changes are reflected.
If I go with the compile.sh option, I think its a lot of work as I have to again unzip the distribution created to execute the code.
I think the issue is originating from ebean..and the issue may be the way that the configuration files have to be passed.
Please let me know the way to setup Dev environment with Dr. Elephant.
When I try to run it as a Play project with the command below,
play -Dconfig.resource=/Users/dr/dr-elephant/app-conf "~run 9000"
I get the following error when I try to navigate to a page like "host:8888/search"
[error] play - Cannot invoke the action, eventually got an error: java.lang.RuntimeException: DataSource user is null?
[error] application -
! @702e5capa - Internal server error, for (GET) [/] ->
play.api.Application$$anon$1: Execution exception[[RuntimeException: DataSource user is null?]]
at play.api.Application$class.handleError(Application.scala:293) ~[play_2.10.jar:2.2.2]
at play.api.DefaultApplication.handleError(Application.scala:399) [play_2.10.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.2]
at scala.Option.map(Option.scala:145) [scala-library-2.10.4.jar:na]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [play_2.10.jar:2.2.2]
Caused by: java.lang.RuntimeException: DataSource user is null?
at com.avaje.ebeaninternal.server.lib.sql.DataSourcePool.(DataSourcePool.java:189) ~[avaje-ebeanorm.jar:na]
at com.avaje.ebeaninternal.server.core.DefaultServerFactory.getDataSourceFromConfig(DefaultServerFactory.java:420) ~[avaje-ebeanorm.jar:na]
at com.avaje.ebeaninternal.server.core.DefaultServerFactory.setDataSource(DefaultServerFactory.java:380) ~[avaje-ebeanorm.jar:na]
at com.avaje.ebeaninternal.server.core.DefaultServerFactory.createServer(DefaultServerFactory.java:163) ~[avaje-ebeanorm.jar:na]
at com.avaje.ebeaninternal.server.core.DefaultServerFactory.createServer(DefaultServerFactory.java:125) ~[avaje-ebeanorm.jar:na]
at com.avaje.ebeaninternal.server.core.DefaultServerFactory.createServer(DefaultServerFactory.java:65) ~[avaje-ebeanorm.jar:na]
I have tried by coping conf folder contents into app-conf and set the following properties in both elephant.conf and application.conf
db.default.driver=com.mysql.jdbc.Driver
db_url=localhost
db.default.url="jdbc:mysql://localhost:3306/drelephant"
db.default.user=root
db.default.password=""
db.default.host=localhost
datasource.db.username=root
datasource.db.password=""
datasource.db.databaseUrl="jdbc:mysql://localhost:3306/drelephant"
datasource.db.databaseDriver=com.mysql.jdbc.Driver
db_name=drelephant
db_user=root
db_password=""
right now we do not have any metrics emitted in Dr E to monitor the queue size and alert on it. we may need metrics on time taken to process jobs, etc. as well.
Hi ,
I'm trying to run with hadoop 2.6 and spark 1.4, and with non-snappy codec for spark history server. It seems the it is always assuming if it's not a folder for application log, it always consider it as snappy.
Please see a small pull request reflect a possible change here:
#47
Thanks
Hi:
I'm use Cloudera Hadoop, version is 2.6.0-cdh5.5.0, spark version is 1.5.0-cdh5.5.0.
This version Spark log, whether it must be configured as "spark.eventLog.compress=true"?
Does this mean that the JSON format can not be parsed ?
05-25-2016 20:01:24 INFO org.apache.spark.deploy.history.SparkFSFetcher$ : Looking for spark logs at logDir: webhdfs://0.0.0.0:50070/data/spark
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner :
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner :
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner : java.security.PrivilegedActionException: java.net.ConnectException: Connection refused
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1636)
at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:99)
at org.apache.spark.deploy.history.SparkFSFetcher.fetchData(SparkFSFetcher.scala:48)
at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:232)
at com.linkedin.drelephant.ElephantRunner$ExecutorThread.run(ElephantRunner.java:151)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:580)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:537)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:605)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:458)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:487)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:483)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:838)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:853)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1400)
at org.apache.spark.deploy.history.SparkFSFetcher.org$apache$spark$deploy$history$SparkFSFetcher$$isLegacyLogDirectory(SparkFSFetcher.scala:185)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:143)
at org.apache.spark.deploy.history.SparkFSFetcher$$anon$1.run(SparkFSFetcher.scala:99)
... 13 more
05-25-2016 20:01:24 ERROR com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1458803017484_64118] into the retry list.
[hdfs@dn01 bin]$ ./start.sh /home/hdfs/dr-elephant/app-conf
Using config dir: /home/hdfs/dr-elephant/app-conf
Using config file: /home/hdfs/dr-elephant/app-conf/elephant.conf
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8080
This is hadoop2.x grid. Add Java library path: /lib/native
Starting Dr. Elephant ....
Dr. Elephant started.
However, it is not started correctly, because the db_url, db_name, db_user is not set yet. And it can not connect to MySQL.
So when I run the stop.sh, it shows:
[hdfs@dn01 bin]$ ./stop.sh
Dr.Elephant is not running.
Hi All - I'm currently working at Paypal and we're having an issue deploying Dr. Elephant. I'd like to start a discussion around the best solution and show what we've done to improve the speed of processing job data.
Currently, DrE is unable to process a large back log of jobs on a given cluster.
Related thread on the mailing list.
Increasing the number of threads being used to query and process responses from the job history server does not improve the speed of processing, and we can see an ever growing queue of jobs waiting to be processed by DrE.
The impact of increasing the number of executors to a large number (240-ish) is shown below. While the queue sized started going down, we noticed that it went back up and hovered around 10k as more jobs were submitted.
$ grep queue dr_elephant.log |tail -10
05-25-2016 17:37:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9085
05-25-2016 17:38:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9081
05-25-2016 17:39:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9078
05-25-2016 17:40:14 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9071
05-25-2016 17:41:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9069
05-25-2016 17:42:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9038
05-25-2016 17:43:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9016
05-25-2016 17:44:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 9010
05-25-2016 17:45:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 8986
05-25-2016 17:46:15 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 8973
At that point we delved into JVisualVM to see where the majority of time was being spent:
The readJsonNode
function handles both reading from the Job History Server and parsing the JSON.
The challenge to scalability becomes clear by looking at the network traffic. Each call to the job history server creates a separate TCP connection and gzip is not enabled by default.
Separate TCP Connections:
The number of TCP connections being created for each MR job is 4 + the number of tasks for that job.
Without Gzip for the request:
GET /ws/v1/history/mapreduce/jobs/job_1464719949755_0001/conf
With Gzip:
When you multiply this by many many jobs, the benefits from gzip become pretty large.
Patch + visualvm output due to patch forthcoming.
I could see a pretty compelling use case for running Dr Elephant on top of AWS to analyze Spark/MR jobs running on EMR.
What I have in mind is, Dr Elephant would be installed in a stand-alone box, and periodically poll the AWS API to check for alive clusters via DescribeCluster, get the hostname for each of those, and automatically fetch jobs running on each cluster and analyze them. The idea is when you have lots of short-running EMR clusters, you can have 1 centralized location with all the results. Optionally, maybe integrate with AWS Data Pipeline to figure out workflows.
Right now the way we have it setup is to run Dr Elephant on each EMR cluster, but this is far from ideal because we lose the results once the cluster goes down unless we export it, and have to reinstall it on every new cluster. It still works because we can have it running in a long-standing staging environment and make sure things are green before they go to prod. But in order to identify trends over multiple days this breaks down.
I haven't dug into the code yet, but what do you think about this idea? I believe there is currently no way to do such a thing, but have you ever had this request or is it something you would be open to consider in Dr Elephant? Happy to help contributing to that once I start looking at the code if you think it would be valuable.
I've configured dr elephant and started as well. from applications.log. it shows fine. no error/exception, but from UI. no MR/Spark job shows. any step missing or configuration?
Thanks in advance.
We plan to add monitoring for Dr. Elephant and are evaluating the following tools.
To start with, we would like to measure the following.
It should be possible to plug-in the monitoring tool into Dr. Elephant based on configuration. Users should have the flexibility to use another tool if they wish so. I have checked on jolokia and it's easy to plug that into the application based on config.
In-order to monitor stats of interest the stat's has to be exposed as part of an attribute in an MBean. The MBean can have several attributes and the attribute values can be updated accordingly from a thread in the application.
We welcome feedback on this feature. Please suggest other monitoring tools that may be interesting or easier to use/configure. Similarly, other params to be monitored.
I deployed dr-elephant successfully .
What is the url?
http://ip:port/drelephant ?
Hi Guys,
When I click the 'Search' button, we got the below errors, how can I fix it?
Caused by: java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:130) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:90) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.ElephantContext.(ElephantContext.java:86) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.ElephantContext.instance(ElephantContext.java:79) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at views.html.page.searchPage$.apply(searchPage.template.scala:85) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at views.html.page.searchPage$.render(searchPage.template.scala:148) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at views.html.page.searchPage.render(searchPage.template.scala) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at controllers.Application.search(Application.java:206) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:113) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:na]
at Routes$$anonfun$routes$1$$anonfun$applyOrElse$3$$anonfun$apply$3.apply(routes_routing.scala:113) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:na]
at play.core.Router$HandlerInvoker$$anon$7$$anon$2.invocation(Router.scala:183) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.Router$Routes$$anon$1.invocation(Router.scala:377) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.JavaAction$$anon$1.call(JavaAction.scala:56) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.GlobalSettings$1.call(GlobalSettings.java:64) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:91) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.JavaAction$$anon$3.apply(JavaAction.scala:90) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.j.FPromiseHelper$$anonfun$flatMap$1.apply(FPromiseHelper.scala:82) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251) ~[org.scala-lang.scala-library-2.10.4.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) ~[org.scala-lang.scala-library-2.10.4.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) [org.scala-lang.scala-library-2.10.4.jar:na]
at play.core.j.HttpExecutionContext$$anon$2.run(HttpExecutionContext.scala:37) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) ~[com.typesafe.akka.akka-actor_2.10-2.2.0.jar:2.2.0]
... 4 common frames omitted
Caused by: java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.7.0_75]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) ~[na:1.7.0_75]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.7.0_75]
at java.lang.reflect.Constructor.newInstance(Constructor.java:526) ~[na:1.7.0_75]
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:109) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
... 27 common frames omitted
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.7.0_75]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) ~[na:1.7.0_75]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:198) ~[na:1.7.0_75]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) ~[na:1.7.0_75]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.7.0_75]
at java.net.Socket.connect(Socket.java:579) ~[na:1.7.0_75]
at java.net.Socket.connect(Socket.java:528) ~[na:1.7.0_75]
at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.(HttpClient.java:211) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.New(HttpClient.java:308) ~[na:1.7.0_75]
at sun.net.www.http.HttpClient.New(HttpClient.java:326) ~[na:1.7.0_75]
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:997) ~[na:1.7.0_75]
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933) ~[na:1.7.0_75]
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:851) ~[na:1.7.0_75]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2$URLFactory.verifyURL(MapReduceFetcherHadoop2.java:167) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2$URLFactory.(MapReduceFetcherHadoop2.java:161) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2$URLFactory.(MapReduceFetcherHadoop2.java:155) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
at com.linkedin.drelephant.mapreduce.MapReduceFetcherHadoop2.(MapReduceFetcherHadoop2.java:69) ~[default.dr-elephant-2.0.3-SNAPSHOT.jar:2.0.3-SNAPSHOT]
... 32 common frames omitted
On the basis of the topic presented above ,because using a new version about hadoop and spark will cause some problem, open issue is a slow way to fix it
Ok figured it out in dev docs
Hi
I am wondering how is the jobhistory page used. I tried to give application id, jobid but both of them result in 'Unable to find record on job url: '
Do we need specific setup to get this working?
[root@scflexnode09 dr-elephant-master]# ./compile.sh
Using the default configuration
Hadoop Version : 2.6.0
Spark Version : 1.5.0
Other opts set :
Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
Default action is to extract files in list, except those in xlist, to exdir;
file[.zip] may be a wildcard. -Z => ZipInfo mode ("unzip -Z" for usage).
-p extract files to pipe, no messages -l list files (short format)
-f freshen existing files, create none -t test compressed archive data
-u update files, create if necessary -z display archive comment only
-v list verbosely/show version info -T timestamp archive to latest
-x exclude files that follow (in xlist) -d extract files into exdir
modifiers:
-n never overwrite existing files -q quiet mode (-qq => quieter)
-o overwrite files WITHOUT prompting -a auto-convert any text files
-j junk paths (do not make directories) -aa treat ALL files as text
-U use escapes for all non-ASCII Unicode -UU ignore any Unicode fields
-C match filenames case-insensitively -L make (some) names lowercase
-X restore UID/GID info -V retain VMS version numbers
-K keep setuid/setgid/tacky permissions -M pipe through "more" pager
See "unzip -hh" or unzip.txt for more help. Examples:
unzip data1 -x joe => extract all files except joe from zipfile data1.zip
unzip -p foo | more => send contents of foo.zip via pipe into program more
unzip -fo foo ReadMe => quietly replace existing ReadMe if archive file newer
hadoop classpath
:${ELEPHANT_CONF_DIR}"/' /bin/dr-elephantzip error: Nothing to do! (.zip)
Hi,
Dr.Elephant whether to support Cloudera Hadoop ?
Dr.Elephant whether to support HDFS or Yarn HA function ?
Can help me, answer these questions?
Hi guys,
I have setup Dr.Elephant. But, I am not able to see any jobs on the dashboard:
Hello there, I've been busy!
I looked through 0 jobs today.
About 0 of them could use some tuning.
About 0 of them need some serious attention!
This might be because it is not able to connect to history server. Can you please help me to figure out where should I do changes to point it to history server. Or please suggest if there could be any other issue.
Hi developers and users,
At Linkedin, we are planning to refactor the UI of Dr. Elephant to make the interaction more intuitive to the user by giving them a flow level perspective and then let them drill down to mapreduce level. The current design shows results at a mapreduce level and then we build on top of it to show information at job and flow level.
We want to discuss and get feedback from the community on what technologies you prefer and would like to use to build the UI components. Currently, Dr. Elephant makes use of Play scala templates to design the views. It is simple and modular for our purpose. Do you have any other suggestions?
The intention is to keep the UI simple, user friendly and easy to develop.
Tagging some contributors of Dr. Elephant:
@shankar37 @nntnag17 @stiga-huang @krishnap @paulbramsen @tglstory @ljank @plypaul @hongbozeng @liyintang @brandtg @timyitong @chetnachaudhari @cjuexuan @rsprabery @miloveme @anspuli @aNutForAJarOfTuna
Thanks,
Akshay
hi:
To ask dr-elephant How to integrate Azkaban
Thanks!
Practically it is possible that a single node cannot analyze all completed applications.
Is it possible to distribute AnalyticJob to a cluster of nodes?
hello:
When i open the webUI , the server report errors as follow:
2016-04-21 16:12:57,337 - [INFO] - from play in main Application started (Prod)
2016-04-21 16:12:57,564 - [INFO] - from play in main Listening for HTTP on /0:0:0:0:0:0:0:0:8050
2016-04-21 16:13:04,872 - [ERROR] - from play.nettyException in New I/O worker #1
Exception caught in Netty
java.lang.NoSuchMethodError: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext;
at play.core.Invoker$.(Invoker.scala:24) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.Invoker$.(Invoker.scala) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$Implicits$.defaultContext$lzycompute(Execution.scala:7) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$Implicits$.defaultContext(Execution.scala:6) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$.(Execution.scala:10) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.api.libs.concurrent.Execution$.(Execution.scala) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.handleAction$1(PlayDefaultUpstreamHandler.scala:201) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:174) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at com.typesafe.netty.http.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:62) ~[com.typesafe.netty.netty-http-pipelining-1.1.2.jar:na]
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[io.netty.netty-3.8.0.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
2016-04-21 16:13:05,669 - [ERROR] - from play.nettyException in New I/O worker #2
Exception caught in Netty
java.lang.NoClassDefFoundError: Could not initialize class play.api.libs.concurrent.Execution$
at play.core.server.netty.PlayDefaultUpstreamHandler.handleAction$1(PlayDefaultUpstreamHandler.scala:201) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at play.core.server.netty.PlayDefaultUpstreamHandler.messageReceived(PlayDefaultUpstreamHandler.scala:174) ~[com.typesafe.play.play_2.10-2.2.6.jar:2.2.6]
at com.typesafe.netty.http.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:62) ~[com.typesafe.netty.netty-http-pipelining-1.1.2.jar:na]
at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[io.netty.netty-3.8.0.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[io.netty.netty-3.8.0.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
could someone help me? tks
Hi, there is an exception when I compile Dr.Elephant.
[info] [SUCCESSFUL ] org.jacoco#org.jacoco.agent;0.7.1.201405082137!org.jacoco.agent.jar (5499ms)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: FAILED DOWNLOADS ::
[warn] :: ^ see resolution messages for details ^ ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.fasterxml.jackson.core#jackson-annotations;2.4.4!jackson-annotations.jar(bundle)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: download failed: com.fasterxml.jackson.core#jackson-annotations;2.4.4!jackson-annotations.jar(bundle)
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:213)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:121)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1144)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1142)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1165)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1163)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1167)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1162)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1170)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1135)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1113)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
error sbt.ResolveException: download failed: com.fasterxml.jackson.core#jackson-annotations;2.4.4!jackson-annotations.jar(bundle)
[error] Total time: 675 s, completed May 25, 2016 7:33:24 PM
Could you help me solve it?
After building Dr.Elephant vis compile.sh, and uncompressing dr-elephant-*.zip, when starting up dr.elephant, you need to set ELEPHANT_CONF_DIR to get some configuration files needed which exists in app-conf directory of source code, like FetcherConf.xml.
So, I think we can add these files to zip package when building Dr.Elephant.
Hi,
I'm trying to compile Dr.Elephant for Hadoop 2.6.0 and Spark 1.5.2, but getting compilation errors. Can you help me to fix the error?
Error Message:
[error] /Users/pkasinathan/workspace/dr-elephant/app/org/apache/spark/deploy/history/SparkDataCollection.scala:217: type mismatch;
[error] found : scala.collection.mutable.HashSet[Int]
[error] required: org.apache.spark.util.collection.OpenHashSet[Int]
[error] addIntSetToJSet(data.completedStageIndices, jobInfo.completedStageIndices)
Full Compile Log:
$ ./compile.sh
Using the default configuration
Hadoop Version : 2.6.0
Spark Version : 1.5.2
Other opts set :
Hi Guys, we have been trying to install Dr. Elephant and after a lot of troubleshooting we are stuck in this part:
[root@ip-172-31-37-252 bin]# ./start.sh
Using config dir: /usr/dr-elephant/app-conf
Using config file: /usr/dr-elephant/app-conf/elephant.conf
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8081
This is hadoop2.x grid. Add Java library path: /lib/native
Starting Dr. Elephant ....
Dr. Elephant started.
but the process never start and looking in the logs we found this:
2016-04-22 20:34:50,374 - [ERROR] - from com.jolbox.bonecp.hooks.AbstractConnectionHook in main
Failed to obtain initial connection Sleeping for 0ms and trying again. Attempts left: 0. Exception: java.net.ConnectException: Conn
ection refused.Message:Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
/hadoop/dr-elephant-2.0.3-SNAPSHOT/logs
please let us know if somebody has been experience this issue or have any idea.
After making change to log4j, the code has to be recompiled in order for new change to become effective.
The restart of Dr. elephant should make change effective rightaway
Doctor elephant is started but not able to see any job On UI.
I have hadoop-2.6.0
spark-1.3.0
Wnen I run the command dr.elephant then I am getting
Reading from config file...
db_url: localhost
db_name: drelephant
db_user: root
http port: 8080
This is hadoop2.x grid. Add Java library path: /Users/Persistent/lib/hadoop-2.6.0/lib/native
Starting Dr. Elephant ....
Dr. Elephant started.
I have started the jobHistory server also
On doctor elephant UI I am getting:
Hello there, I've been busy!
I looked through 0 jobs today.
About 0 of them could use some tuning.
About 0 of them need some serious attention!
and my dr_elephant.log file contains
05-19-2016 09:54:07 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : AnalysisProvider updating its Authenticate Token...
05-19-2016 09:54:14 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463576245175, and current time: 1463631787423
05-19-2016 09:54:14 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463576245175&finishedTimeEnd=1463631787423
05-19-2016 09:54:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463576245175&finishedTimeEnd=1463631787423
05-19-2016 09:54:22 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:54:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:54:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631787424, and current time: 1463631802451
05-19-2016 09:54:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631787424&finishedTimeEnd=1463631802451
05-19-2016 09:54:23 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463631787424&finishedTimeEnd=1463631802451
05-19-2016 09:54:23 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:55:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:55:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631802452, and current time: 1463631862562
05-19-2016 09:55:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631802452&finishedTimeEnd=1463631862562
05-19-2016 09:55:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463631802452&finishedTimeEnd=1463631862562
05-19-2016 09:55:22 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:56:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:56:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631862563, and current time: 1463631922456
05-19-2016 09:56:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631862563&finishedTimeEnd=1463631922456
05-19-2016 09:56:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=FAILED&finishedTimeBegin=1463631862563&finishedTimeEnd=1463631922456
05-19-2016 09:56:22 INFO com.linkedin.drelephant.ElephantRunner : Job queue size is 0
05-19-2016 09:57:22 INFO com.linkedin.drelephant.ElephantRunner : Fetching analytic job list...
05-19-2016 09:57:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Fetching recent finished application runs between last time: 1463631922457, and current time: 1463631982451
05-19-2016 09:57:22 INFO com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is http://127.0.0.1:8088/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1463631922457&finishedTimeEnd=1463631982451
please give me a solution..
Hi,
my env:
centos 6.5
java 1.8.0_72
Hadoop 2.6.0-cdh5.4.3
dr-elephant is build from sources (master branch)
during application start:
Exception in thread "Thread-6" java.lang.ExceptionInInitializerError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2051)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2016)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2110)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2136)
at org.apache.hadoop.security.Groups.<init>(Groups.java:78)
at org.apache.hadoop.security.Groups.<init>(Groups.java:74)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:303)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:311)
at com.linkedin.drelephant.security.HadoopSecurity.<init>(HadoopSecurity.java:43)
at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:100)
at com.linkedin.drelephant.DrElephant.run(DrElephant.java:34)
Caused by: java.lang.RuntimeException: Bailing out since native library couldn't be loaded
at org.apache.hadoop.security.JniBasedUnixGroupsMapping.<clinit>(JniBasedUnixGroupsMapping.java:46)
... 14 more
when hit help tab
[ESC[31merrorESC[0m] play - Cannot invoke the action, eventually got an error: java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2
[ESC[31merrorESC[0m] application -
! @712k9ga21 - Internal server error, for (GET) [/help] ->
play.api.Application$$anon$1: Execution exception[[RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2]]
at play.api.Application$class.handleError(Application.scala:293) ~[com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.api.DefaultApplication.handleError(Application.scala:399) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2$$anonfun$applyOrElse$3.apply(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
at scala.Option.map(Option.scala:145) [org.scala-lang.scala-library-2.10.4.jar:na]
at play.core.server.netty.PlayDefaultUpstreamHandler$$anonfun$2.applyOrElse(PlayDefaultUpstreamHandler.scala:261) [com.typesafe.play.play_2.10-2.2.2.jar:2.2.2]
Caused by: java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.mapreduce.fetchers.MapReduceFetcherHadoop2
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:181) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:103) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.<init>(ElephantContext.java:98) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.instance(ElephantContext.java:91) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.helpPage$.apply(helpPage.template.scala:69) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at views.html.page.helpPage$.render(helpPage.template.scala:90) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
Caused by: java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_72]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_72]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_72]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_72]
at com.linkedin.drelephant.ElephantContext.loadFetchers(ElephantContext.java:160) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
at com.linkedin.drelephant.ElephantContext.loadConfiguration(ElephantContext.java:103) ~[com.linkedin.drelephant.dr-elephant-2.0.5.jar:2.0.5]
Caused by: java.net.UnknownHostException: null
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) ~[na:1.8.0_72]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.8.0_72]
at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_72]
at java.net.Socket.connect(Socket.java:538) ~[na:1.8.0_72]
at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[na:1.8.0_72]
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[na:1.8.0_72]
Can a LinkedIn Github admin please go to https://travis-ci.org/linkedin/dr-elephant/ and click activate? It doesn't cost anything and would be great to see build statuses in PRs. All the supporting infrastructure is in place as of #86. Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.