mesos / hadoop Goto Github PK
View Code? Open in Web Editor NEWHadoop on Mesos
Hadoop on Mesos
For some use cases and environments it's handy to set the HADOOP_HOME
variable. Especially if you're using Hadoop Streaming and need to access resources inside the hadoop source directory.
I have a feeling passing this as an environment variable to the executor should do the trick.
Hey. I'm trying to run the Hadoop jobtracker in a containerised bridged network mode (i.e. the default network mode for Docker). My goal is to launch the jobtracker with Marathon and map the ports randomly to my host system, finding the web-ui and jobtracker IPC port through service discovery.
The hostname of the host (not the Docker container hostname, which is a random hash string) is available through environment variable $HOSTNAME, and using this at runtime when launching the jobtracker in host-mode works fine. That is with --net=host
when issuing docker run
. The script I have for doing this is very similar to this one.
When running in bridged mode, I first tried setting the mapred.job.tracker
property to localhost:9001
, and equally with the web-ui property. However, this disabled external exposure of the port - as in I got no contact with the container when running the mapping using docker run -p 8080:50030 -p 9001:9001 ...
Chaning the mapred.job.tracker
to $HOST:9001
where $HOST
equals the hostname of the docker container enabled me to contact the Docker container and it seems to work alright - the only bummer is that the hostname in the web-ui is a stupid Docker string, which would be nice to override, but nevermind. But, everything seems to be working - until I look in Mesos.
In mesos I see a new Hadoop: (RPC port: 9001, WebUI port: 50030)
framework trying to register itself every 3 second or so, without success. Enabling more debugging on the client side (using export GLOG_v=2
) I see the following output when starting up the tracker:
I0210 15:39:21.288871 881 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:21.288862976+00:00
I0210 15:39:21.288918 881 pid.cpp:87] Attempting to parse '[email protected]:5050' into a PID
I0210 15:39:21.288990 881 sched.cpp:234] New master detected at [email protected]:5050
I0210 15:39:21.289201 881 sched.cpp:242] No credentials provided. Attempting to register without authentication
I0210 15:39:21.289227 881 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:21.289579 891 process.cpp:2692] Resuming zookeeper-master-detector(1)@198.41.200.200:58589 at 2015-02-10 15:39:21.289572096+00:00
15/02/10 15:39:21 INFO util.HostsFileReader: Setting the includes file to
15/02/10 15:39:21 INFO util.HostsFileReader: Setting the excludes file to
15/02/10 15:39:21 INFO util.HostsFileReader: Refreshing hosts (include/exclude) list
15/02/10 15:39:21 INFO mapred.JobTracker: Decommissioning 0 nodes
15/02/10 15:39:21 INFO ipc.Server: IPC Server Responder: starting
15/02/10 15:39:21 INFO ipc.Server: IPC Server listener on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 0 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 1 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 2 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 3 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 5 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 6 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 7 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 4 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 8 on 9001: starting
15/02/10 15:39:21 INFO mapred.JobTracker: Starting RUNNING
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 9 on 9001: starting
I0210 15:39:22.289831 882 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:22.289812992+00:00
I0210 15:39:22.289913 882 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:23.290331 892 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:23.290322176+00:00
I0210 15:39:23.290382 892 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:24.290717 881 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:24.290708992+00:00
I0210 15:39:24.290767 881 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:25.291088 893 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:25.291079168+00:00
I0210 15:39:25.291139 893 sched.cpp:481] Sending registration request to [email protected]:5050
And the "Resuming scheduler", "Sending registration request"... output continues forever.
I0210 10:28:00.393909 31003 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:00.394260 31003 master.cpp:1447] Registering framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:00.394639 30999 hierarchical_allocator_process.hpp:329] Added framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:00.395987 31005 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:00.732815 31008 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:01.071971 31002 hierarchical_allocator_process.hpp:405] Deactivated framework 20140618-174325-1209730570-5050-4637-0002
I0210 10:28:01.394320 30997 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:01.394753 30997 master.cpp:1434] Framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 already registered, resending acknowledgement
I0210 10:28:02.394582 31011 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:02.395097 31011 master.cpp:1434] Framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 already registered, resending acknowledgement
I0210 10:28:03.363574 31000 master.cpp:789] Framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 disconnected
I0210 10:28:03.363788 31000 master.cpp:1752] Disconnecting framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.363852 31000 master.cpp:1768] Deactivating framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.363956 31002 hierarchical_allocator_process.hpp:405] Deactivated framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:03.364524 31000 master.cpp:811] Giving framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 0ns to failover
I0210 10:28:03.364547 31008 hierarchical_allocator_process.hpp:563] Recovered cpus(*):15.8; mem(*):192135; ports(*):[31000-32000, 8001-9000]; disk(*):1.51388e+06 (total allocatable: cpus(*):15.8; mem(*):192135; ports(*):[31000-32000, 8001-9000]; disk(*):1.51388e+06) on slave 20150204-135039-1176176138-5050-11013-S0 from framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:03.364784 31007 master.cpp:3713] Framework failover timeout, removing framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.364966 31007 master.cpp:4271] Removing framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.365542 31007 hierarchical_allocator_process.hpp:360] Removed framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:03.394796 31011 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:03.395130 31011 master.cpp:1447] Registering framework 20150204-151306-1176176138-5050-30988-1937 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.395417 31005 hierarchical_allocator_process.hpp:329] Added framework 20150204-151306-1176176138-5050-30988-1937
I0210 10:28:03.396935 30996 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1937 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.583683 31005 http.cpp:478] HTTP request for '/master/state.json'
I0210 10:28:03.737133 30998 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1937 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
This too, continuing in a loop like this forever.
My questions:
driver.start()
is called. Or looking at the mesos code, perhaps during the initialisation?Sorry for the long wall of text here, but I didn't want to exclude any (perhaps) important details. Appreciate any feedback, also those not necessarily giving away "the solution". :-)
I’m having an issue running Hadoop job on Mesos cluster. I have followed README and was suceccsful starting JobTracker and running wordcount example Hadoop job on cluster.
However, when I try to launch a larger job (Camus export data from Kafka to HDFS) I see only one TaskTracker started `allocating only 2 Map slots (default configured for a node) and not using any other nodes (cluster consists of 5 nodes and I requested total 30 Map tasks).
In my setup I use Cloudera Hadoop distribution version 2.6.0-cdh5.4.2
Mesos version 0.22.1
And latest mesos-hadoop-mr1-0.1.1 (git commit c972174)
What am I missing? Or is it intended behavior?
Thanks.
Is Hadoop 2.2 supported? or twhen do you you think to support it?
Thanks in advance
Matteo
http://www.redaelli.org/matteo/
Is it really necessary to keep extracting the hadoop distribution every time (referenced by mapred.mesos.executor.uri
)? Spark imo does this somewhat smarter by just referencing (if no URI is given), the current hadoop distribution's path to the (equivalent of) Task_Tracker. This often holds on clusters with shared storage among the nodes. This would really shorten the startup time, or do I have an configuration error that I'm seeing all those extractions?
Besides the extra job slowdown this incurs, it's also very wasteful disk-space wise.
*logs/hadoop-hadoop-jobtracker-e1bc6944193b.log *
2016-04-15 01:27:11,987 INFO org.apache.hadoop.mapred.ResourcePolicy: Launching task Task_Tracker_1 on http://10.102.0.7:31597 with mapSlots=2 reduceSlots=1
2016-04-15 01:27:11,987 INFO org.apache.hadoop.mapred.ResourcePolicy: URI: hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz, name: hadoop-2.5.0-cdh5.2.0.tar.gz
2016-04-15 01:27:12,019 INFO org.apache.hadoop.mapred.ResourcePolicy: Unable to fully satisfy needed map/reduce slots: 1 map slots remaining
2016-04-15 01:27:12,407 INFO org.apache.hadoop.mapred.MesosScheduler: Status update of Task_Tracker_1 to TASK_FAILED with message
2016-04-15 01:27:12,407 INFO org.apache.hadoop.mapred.MesosScheduler: Removing terminated TaskTracker: http://10.102.0.7:31597
2016-04-15 01:27:12,989 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
Mesos cluster Sandbox logs
I0415 01:28:02.248667 1514 logging.cpp:172] INFO level logging started!
I0415 01:28:02.249013 1514 fetcher.cpp:409] Fetcher Info: {"cache_directory":"/tmp/mesos/fetch/slaves/20160414-122017-100689418-5050-344-S2/hadoop","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz"}}],"sandbox_directory":"/var/lib/mesos/slaves/20160414-122017-100689418-5050-344-S2/frameworks/20160415-012211-100689418-5050-1018-0000/executors/executor_Task_Tracker_99/runs/aca23ee7-fd55-4c6c-b01a-c3d292fb2ba9","user":"hadoop"}
I0415 01:28:02.253449 1514 fetcher.cpp:364] Fetching URI 'hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz'
I0415 01:28:02.253487 1514 fetcher.cpp:238] Fetching directly into the sandbox directory
I0415 01:28:02.253516 1514 fetcher.cpp:176] Fetching URI 'hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz'
mesos-fetcher: ../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: const string& Try::error() const [with T = bool; std::string = std::basic_string]: Assertion `data.isNone()' failed.
*** Aborted at 1460683682 (unix time) try "date -d @1460683682" if you are using GNU date ***
PC: @ 0x7f90ffe5ccc9 (unknown)
*** SIGABRT (@0x5ea) received by PID 1514 (TID 0x7f9105b8d7c0) from PID 1514; stack trace: ***
@ 0x7f91001fb340 (unknown)
@ 0x7f90ffe5ccc9 (unknown)
@ 0x7f90ffe600d8 (unknown)
@ 0x7f90ffe55b86 (unknown)
@ 0x7f90ffe55c32 (unknown)
@ 0x460d43 Try<>::error()
@ 0x450f8d downloadWithHadoopClient()
@ 0x451f7c download()
@ 0x452347 fetchBypassingCache()
@ 0x45315f fetch()
@ 0x4539c5 main
@ 0x7f90ffe47ec5 (unknown)
@ 0x450459 (unknown)
Aborted (core dumped)
Failed to synchronize with slave (it's probably exited)
In some situations, when using this framework in a very small mesos cluster (enough for only a few tasks) only map slots will be allocated and zero reduce slots. This can cause a dead-lock and although the effect should be reduced by the introduction of #33 (once the map slots became idle, reduce slots would take oveR) in some cases with certain resource allocations it can still occur.
I guess we should do a better job at allocating a decent map/reduce slot ratio.
Any guide about deploying CDH5.0.2 tarball with MRv1?
My plan is to deploy a pure hadoop(without mesos) first,
then connect hadoop with mesos
but seams hard to deploy CDH5.0.2 tarball with MRv1, any guide about this?
Thanks
Kind of obvious, but something I misconfigured and took be a really long time to get to the bottom of. I just assumed if I configured it to zero no maximum would apply (which was the desired behaviour).
Perhaps it's something that should be documented, or an exception be thrown to highlight the configuration error. Alternatively make zero mean no maximum?
What's the reason you might want to limit the number of slots per TT?
I run wordcount demo use hadoop-2.5.0-cdh5.3.2 (hadoop-mapreduce1-project) + mesos-0.23.0-rc4
Found “launched but no heartbeat yet” in jobTracker log all the time
in task's stdout:
CPLUS_INCLUDE_PATH=/opt/lib/boost_1_58_0
MANPATH=/opt/lib/mvapich2.2/share/man:/opt/compiler/gcc-4.8.2/man:/usr/share/man
HOSTNAME=dn-137-211
...
HISTSIZE=1000
HADOOP_HOME=/home/hadoop/hadoop2
HADOOP_DEV_HOME=/home/hadoop/hadoop2
LIBRARY_PATH=/opt/compiler/gcc-4.8.2/lib64:/opt/lib/cuda-6.5/lib
MESOS_DIRECTORY=/home/mesos/slave/slaves/20150918-160900-3549014208-5050-14584-S0/frameworks/20150923-170913-3549014208-5050-6111-0008/executors/executor_Task_Tracker_0/runs/39ece48d-1f91-446b-ae49-72a0a6f66346
FPATH=/opt/lib/mvapich2.2/include
OLDPWD=/home/mesos/slave/slaves/20150918-160900-3549014208-5050-14584-S0/frameworks/20150923-170913-3549014208-5050-6111-0008/executors/executor_Task_Tracker_0/runs/39ece48d-1f91-446b-ae49-72a0a6f66346
SSH_TTY=/dev/pts/3
LC_ALL=C
USER=root
.....
LD_LIBRARY_PATH=/opt/lib/mvapich2.2/lib:/opt/lib/mvapich2.2/lib/shared:/opt/lib/liblmdb-0.9/lib:/opt/lib/protobuf-2.5/lib:/opt/lib/gflag-1.4.0/lib:/opt/lib/glog-0.3.3/lib:/opt/lib/boost_1_58_0/stage/lib:/opt/lib/opencv-2.4.9/lib:/opt/lib/log4cplus-1.2.0-rc3/lib:/opt/tool/intel/lib/intel64:/opt/tool/intel/mkl/lib/intel64:/opt/compiler/gcc-4.8.2/lib64:/opt/lib/cuda-6.5/lib64
MESOS_EXECUTOR_ID=executor_Task_Tracker_0
CPATH=/opt/lib/mvapich2.2/include:/opt/lib/liblmdb-0.9/include:/opt/lib/protobuf-2.5/include:/opt/lib/gflag-1.4.0/include:/opt/lib/glog-0.3.3/include:/opt/lib/opencv-2.4.9/include:/opt/lib/log4cplus-1.2.0-rc3/include:/opt/lib/cuda-6.5/include
HADOOP_MAPARED_HOME=/home/hadoop/hadoop2
PATH=/home/hadoop/spark/bin:/home/hadoop/spark/sbin:/home/hadoop/spark/lib:/opt/scheduler/mesos-0.23.0-rc4/libexec/mesos:/opt/scheduler/mesos-0.23.0-rc4/libexec:/opt/scheduler/mesos-0.23.0-rc4/bin:/opt/scheduler/mesos-0.23.0-rc4/sbin:/opt/scheduler/mesos-0.23.0-rc4/lib:/home/hadoop/hadoop2/sbin:/home/hadoop/hadoop2/bin:/opt/tool/git-2.4.5/bin:/opt/tool/git-2.4.5/bin:....6.5/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/root/bin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin
HDFS_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
MESOS_HOME=/home/mesos
HADOOP_HDFS_HOME=/home/hadoop/hadoop2
PWD=/home/mesos/slave/slaves/20150918-160900-3549014208-5050-14584-S0/frameworks/20150923-170913-3549014208-5050-6111-0008/executors/executor_Task_Tracker_0/runs/39ece48d-1f91-446b-ae49-72a0a6f66346/hadoop-2.5.0-cdh5.3.2
HADOOP_COMMON_HOME=/home/hadoop/hadoop2
F90=gfortran
MESOS_NATIVE_JAVA_LIBRARY=/opt/scheduler/mesos/lib/libmesos-0.23.0.so
JAVA_HOME=/opt/lib/jdk
MESOS_NATIVE_LIBRARY=/opt/scheduler/mesos/lib/libmesos.so
HADOOP_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
HADOOP_OPTS=-Xmx4096m -XX:NewSize=1365m -XX:MaxNewSize=2457m
MESOS_SLAVE_PID=slave(1)@192.168.137.211:5051
MESOS_FRAMEWORK_ID=20150923-170913-3549014208-5050-6111-0008
MESOS_PATH=/opt/scheduler/mesos-0.23.0-rc4
MESOS_CHECKPOINT=0
SHLVL=2
HOME=/root
LIBPROCESS_PORT=0
YARN_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
MESOS_SLAVE_ID=20150918-160900-3549014208-5050-14584-S0
MODULESHOME=/usr/share/Modules
HADOOP_BIN=/home/hadoop/hadoop2/bin
...
_=/bin/env
Error occurred during initialization of VM
Too small initial heap for new size specified```
Hi,
I can't get my cluster of 80 CPUs and 200GB+ of mem to allocate the last 8.5 CPUs.
In the logging I can see this repeatedly:
15/03/09 11:56:12 INFO mapred.ResourcePolicy: Declining offer with insufficient resources for a TaskTracker:
cpus: offered 0.8499999940395355 needed at least 0.15000000596046448
mem : offered 20182.0 needed at least 368.0
disk: offered 1859053.0 needed at least 0.0
ports: at least 2 (sufficient)
I'm not sure why the hadoop/mesos is declining, every resource demand has been met.
I've not got this up and running yet, so just thinking it through. It'd be great (and a requirement for me) to support HA jobtrackers. I think there are a couple of small issues that currently prevent that, though please correct me if i'm wrong. Is anyone else successfully running HA JobTrackers on Mesos?
mapred.job.tracker
option to a host:port combo. This isn't how HA jobtrackers work, you list (elsewhere in configuration) the name:host:port combo's and reference them with an alias here. Due to the way the configuration is passed to the TaskTrackers, the only issues I can find here is #27. I can confirm with this fix applied I can at least get TT's to launch and run with HA jobtracker config.Looking forward to anyone's thoughts...
So this is an interesting issue. I've seen (several times now) situations where the Hadoop scheduler will get itself into a deadlock with running jobs. Here's how it goes.
Cluster: Some number of mesos slaves, let's say the resources equate to 100 slots. The underlying scheduler here is the hadoop FairScheduler, not the FIFO one.
At this point, all the cluster resources are being given to the running TaskTrackers. These resources are not going to be released until the running job completes, but that job is waiting for some reducers to launch. This is a deadlock, the job will never complete because the task trackers are never released, and vice versa.
I'm wondering if you can suggest anything here @brndnmtthws @florianleibert @benh?
This problem fits quite well with the Task/Executor relationship. In this example I need to keep the executors alive (so they can stream data to the reducers for shuffle/sort) but I need to free up the "slots" or task resources. Perhaps if the framework was able to terminate the Task that held resources for the slots independently from the TaskTracker itself, and then internally mark that Task Tracker as "going to be killed soon".
We have to maintain some state internally because it is not possible to reduce the number of slots on a task tracker while it is running, so the hadoop/mesos scheduler needs to pro-actively not schedule tasks there. Though I don't think this is too complicated to do.
Output from jStack:
Attaching to process ID 22531, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
Deadlock Detection:
Found one Java-level deadlock:
"IPC Server handler 4 on 7676":
waiting to lock Monitor@0x00007f01ec11b858 (Object@0x00000000831924a0, a org/apache/hadoop/mapred/MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock Monitor@0x00007f01ec32f9d8 (Object@0x00000000830f4310, a org/apache/hadoop/mapred/JobTracker),
which is held by "IPC Server handler 4 on 7676"
Found a total of 1 deadlock.
With the recent enhancements that landed related to freeing up some resources when a TaskTracker becomes idle, Hadoop is a little less greedy about holding onto cluster resources when it's not actually using them. However, because this is based on the whole TaskTracker being idle, we don't get the best chance of freeing resources when TTs have mixed slots, both map and reduce.
We should launch separate TTs for map and reduce slots. To do this effectively, we probably want to try and bunch up a many map or reduce slots onto each node as possible, as opposed to the current logic, which is to apply the map/reduce slot ratio to each incoming offer. Take the following example...
1 Slot = 1 CPU and 1GB RAM
Offers:
Pending tasks:
Current result:
Ideal Result:
This is something i've been thinking about for a while, and i'd be interested if anyone else (@brndnmtthws @florianleibert?) has any contributions to the idea.
Currently the Hadoop on Mesos framework will treat every job equal, they require equal CPU, equal Memory and will run in the same TaskTracker (mesos task) environment. This is not actually always the case, and making the framework more intelligent could reap some great benefits...
Of course spinning up a Hadoop TT for every job might be a little excessive, so the scheduler could be more intelligent and bucket types of jobs to types of task trackers. The Job.Task->TaskTracker assignment would need to change too, I guess.
In doing this the framework starts to become on par with YARN, or even more efficient, as we're able to share TTs between jobs that can share. As far as I'm aware YARN will launch a little JT and TTs for each job you submit? I'm probably wrong though.
The third point (roles) is the one i'm most interested in seeing first.
(Perhaps something for the #MesosCon hackathon 😄)
with mesos 0.27.0
Class org.apache.mesos.Protos.CommandInfo.ContainerInfo is removed (deprecated before 0.27.0)
as a result, mesos-hadoop build faild with mesos version >=0.27
mesos-hadoop code update is required to match new mesos version.
Thank you very much
In kill tasks lines 128-138. scheduleSuicideTimer doesn't block so the thread creating the TASK_FINISHED message finishes possible (and in some cases often) before the executor kills the task. This can leave processes open and cause offers to go out to new tasks before they're available, leading to crashes due to port conflicts etc.
I think the solution is to put blocking version of scheduleSuicideTimer() in the thread that builds the task finished mehtod.
Thoughts?
Even after copying the hadoop-mesos-0.1.0.jar into the share/hadoop/common/lib folder.. i get this error. I could see that Jobtracker is loading this lib. I could see this loib being loaded in jobtracker message "STARTUP_MSG: classpath = ..."
I am using ubuntu 18.04
java openjdk version "1.8.0_252"
I'm trying to run this project on older mapreduce, 1.0.x. If it were up to me I wouldn't be using this but I have to. I've already successfully set this up under the recommended 2.5.0 version and I had no problems at all, so this is probably something specific to 1.0.x.
So, trying to get this to work eventually leads me to the following error/log on a mesos-executor
cat /local/vdbogert/var/lib/mesos/slaves/20150420-104917-33592586-5050-17145-S0/frameworks/20150424-154625-234919178-5050-2360-0000/executors/executor_Task_Tracker_97/runs/latest/stderr
I0424 15:47:49.596091 9570 exec.cpp:132] Version: 0.21.0
I0424 15:47:49.610085 9586 exec.cpp:206] Executor registered on slave 20150420-104917-33592586-5050-17145-S0
15/04/24 15:47:49 INFO mapred.MesosExecutor: Executor registered with the slave
15/04/24 15:47:49 INFO mapred.MesosExecutor: Launching task : Task_Tracker_97
java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobConf.writeXml(Ljava/io/Writer;)V
at org.apache.hadoop.mapred.MesosExecutor.configure(MesosExecutor.java:48)
at org.apache.hadoop.mapred.MesosExecutor.launchTask(MesosExecutor.java:80)
Exception in thread "Thread-1" I0424 15:47:49.731550 9586 exec.cpp:413] Deactivating the executor libprocess
Has anybody got this to work under older hadoop distro's? If not, can someone estimate how much work it would be, solving problems like the above.
And as last question, how would I solve the above error.
Thanks
My requirement is to configure hadoop on mesos where mesos and hadoop will be installed on different servers. I have below queries on that
I am a storage guy and hadoop/mesos are totally new to me. Any help on this with basic information would be really appreciated.
Thanks.
Hello!
I am using Maven 3.3.3, and when I followed the first few instructions in the README, I got this result:
me@server# mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building mesos-hadoop-mr1 0.1.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://clojars.org/repo/org/clojars/brenden/metrics-cassandra/3.1.0/metrics-cassandra-3.1.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.926 s
[INFO] Finished at: 2015-08-05T12:34:48-06:00
[INFO] Final Memory: 12M/239M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project mesos-hadoop-mr1: Could not resolve dependencies for project com.github.mesos:mesos-hadoop-mr1:jar:0.1.1-SNAPSHOT: Failed to collect dependencies at org.clojars.brenden:metrics-cassandra:jar:3.1.0: Failed to read artifact descriptor for org.clojars.brenden:metrics-cassandra:jar:3.1.0: Could not transfer artifact org.clojars.brenden:metrics-cassandra:pom:3.1.0 from/to clojars.org (http://clojars.org/repo): Access denied to: http://clojars.org/repo/org/clojars/brenden/metrics-cassandra/3.1.0/metrics-cassandra-3.1.0.pom , ReasonPhrase:Forbidden. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
My machine is behind a proxy, but the proxy settings are correctly configured, and Maven is able to download other dependencies. This one in particular fails, even though I can access it in lynx, and download the pom in the URL using wget.
One of the Mesos promises was being to be able to run multiple instances of 1 type of framework, like Hadoop in this case. The original Mesos paper showed 2 concurrent Hadoop instances.
I've tried to setup an experiment where I have 2 jobtrackers each with one and the same Mesos backend, though if 1 jobtracker's queue is full enough it takes all resources and keeps them indefinitely, starving the other jobtracker.
Correct me if I'm wrong, but shouldn't the aim of this hadoop-on-mesos backend be, to cater this scenario of multiple instances?
As image shown below, the pseudo distributed operation is not found when click it.
Currently there is no style enforcement except by committers after a PR. This is inefficient for everyone. The maven check-styles plugin can help with this. I was thinking of something pretty lax to start just watching for white space and unused imports. Thoughts?
I'm happy to work this but this is a thorny issue, so I'd like committer feedback first.
I needed framework authentication, I already implemented this and the ability to specify a framework name (It's easier for ACLS). Will submit PR after updating documentation.
This is something I've found a little painful while experimenting with jobs. Given that on a fairly quiet cluster, task trackers are being launched and terminated quite quickly, there's no way of accessing the stdout/stderr logs of a task attempt through the web UI.
I'd be interested to know if anyone else has found this frustrating? I know you can grab the logs via the mesos web UI... though it's a little cumbersome.
hadoop no mesos + spark on mesos + oozie
Hadoop map reduce job is running properly.A Spark job make use of Mesos Docker containerizer. when oozie used a map reduce job to schedule that Spark job. Mesos Docker containerizer cannot run.
I’m not sure this Question, any idea on how to resolve this problem?
Build mesos.version ‘0.23.1’.
Hadoop cluster mesos.version ‘1.0.1’.
mesos task running stderr:
I0213 15:13:47.948076 13351 fetcher.cpp:547] Fetched '/etc/docker.tar.gz' to '/var/lib/mesos/slaves/8a2d57b7-3a9b-478d-8fc3-19e3e573ab6c-S18/frameworks/8a2d57b7-3a9b-478d-8fc3-19e3e573ab6c-1166/executors/driver-20170213151343-124182/runs/a8eb3fa4-1919-4b26-a05f-ad3fdcc32aed/docker.tar.gz'
I0213 15:13:48.161221 13629 exec.cpp:161] Version: 1.0.1
I0213 15:13:48.165864 13634 exec.cpp:413] Executor asked to shutdown
[Executor registered on agent ] information not be found.It seems like Registered failed, so mesos Docker containerizer cannot run.
mesos system mesos-slave.ERROR
E0213 15:13:48.616093 9896 slave.cpp:2621] Status update acknowledgement (UUID: 5953d089-e259-4f63-9eb6-ca97f73f10f7) for task Task_Tracker_0 of unknown executor
Currently the Hadoop on Mesos framework only supports the old-style container info protos (used for the External Containerizer). We should also add support for docker container info.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.