mesos / hadoop Goto Github PK

Hadoop on Mesos

Java 100.00%

hadoop's Issues

Pass through a HADOOP_HOME environment variable

For some use cases and environments it's handy to set the HADOOP_HOME variable. Especially if you're using Hadoop Streaming and need to access resources inside the hadoop source directory.

I have a feeling passing this as an environment variable to the executor should do the trick.

Issue setting up framework, running the jobtracker in containerised bridged network mode

Hey. I'm trying to run the Hadoop jobtracker in a containerised bridged network mode (i.e. the default network mode for Docker). My goal is to launch the jobtracker with Marathon and map the ports randomly to my host system, finding the web-ui and jobtracker IPC port through service discovery.

The hostname of the host (not the Docker container hostname, which is a random hash string) is available through environment variable $HOSTNAME, and using this at runtime when launching the jobtracker in host-mode works fine. That is with --net=host when issuing docker run. The script I have for doing this is very similar to this one.

When running in bridged mode, I first tried setting the mapred.job.tracker property to localhost:9001, and equally with the web-ui property. However, this disabled external exposure of the port - as in I got no contact with the container when running the mapping using docker run -p 8080:50030 -p 9001:9001 ...

Chaning the mapred.job.tracker to $HOST:9001 where $HOST equals the hostname of the docker container enabled me to contact the Docker container and it seems to work alright - the only bummer is that the hostname in the web-ui is a stupid Docker string, which would be nice to override, but nevermind. But, everything seems to be working - until I look in Mesos.

In mesos I see a new Hadoop: (RPC port: 9001, WebUI port: 50030) framework trying to register itself every 3 second or so, without success. Enabling more debugging on the client side (using export GLOG_v=2) I see the following output when starting up the tracker:

I0210 15:39:21.288871   881 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:21.288862976+00:00
I0210 15:39:21.288918   881 pid.cpp:87] Attempting to parse '[email protected]:5050' into a PID
I0210 15:39:21.288990   881 sched.cpp:234] New master detected at [email protected]:5050
I0210 15:39:21.289201   881 sched.cpp:242] No credentials provided. Attempting to register without authentication
I0210 15:39:21.289227   881 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:21.289579   891 process.cpp:2692] Resuming zookeeper-master-detector(1)@198.41.200.200:58589 at 2015-02-10 15:39:21.289572096+00:00
15/02/10 15:39:21 INFO util.HostsFileReader: Setting the includes file to
15/02/10 15:39:21 INFO util.HostsFileReader: Setting the excludes file to
15/02/10 15:39:21 INFO util.HostsFileReader: Refreshing hosts (include/exclude) list
15/02/10 15:39:21 INFO mapred.JobTracker: Decommissioning 0 nodes
15/02/10 15:39:21 INFO ipc.Server: IPC Server Responder: starting
15/02/10 15:39:21 INFO ipc.Server: IPC Server listener on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 0 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 1 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 2 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 3 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 5 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 6 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 7 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 4 on 9001: starting
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 8 on 9001: starting
15/02/10 15:39:21 INFO mapred.JobTracker: Starting RUNNING
15/02/10 15:39:21 DEBUG ipc.Server: IPC Server handler 9 on 9001: starting
I0210 15:39:22.289831   882 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:22.289812992+00:00
I0210 15:39:22.289913   882 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:23.290331   892 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:23.290322176+00:00
I0210 15:39:23.290382   892 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:24.290717   881 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:24.290708992+00:00
I0210 15:39:24.290767   881 sched.cpp:481] Sending registration request to [email protected]:5050
I0210 15:39:25.291088   893 process.cpp:2692] Resuming [email protected]:58589 at 2015-02-10 15:39:25.291079168+00:00
I0210 15:39:25.291139   893 sched.cpp:481] Sending registration request to [email protected]:5050

And the "Resuming scheduler", "Sending registration request"... output continues forever.

I0210 10:28:00.393909 31003 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:00.394260 31003 master.cpp:1447] Registering framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:00.394639 30999 hierarchical_allocator_process.hpp:329] Added framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:00.395987 31005 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:00.732815 31008 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:01.071971 31002 hierarchical_allocator_process.hpp:405] Deactivated framework 20140618-174325-1209730570-5050-4637-0002
I0210 10:28:01.394320 30997 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:01.394753 30997 master.cpp:1434] Framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 already registered, resending acknowledgement
I0210 10:28:02.394582 31011 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:02.395097 31011 master.cpp:1434] Framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 already registered, resending acknowledgement
I0210 10:28:03.363574 31000 master.cpp:789] Framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 disconnected
I0210 10:28:03.363788 31000 master.cpp:1752] Disconnecting framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.363852 31000 master.cpp:1768] Deactivating framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.363956 31002 hierarchical_allocator_process.hpp:405] Deactivated framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:03.364524 31000 master.cpp:811] Giving framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973 0ns to failover
I0210 10:28:03.364547 31008 hierarchical_allocator_process.hpp:563] Recovered cpus(*):15.8; mem(*):192135; ports(*):[31000-32000, 8001-9000]; disk(*):1.51388e+06 (total allocatable: cpus(*):15.8; mem(*):192135; ports(*):[31000-32000, 8001-9000]; disk(*):1.51388e+06) on slave 20150204-135039-1176176138-5050-11013-S0 from framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:03.364784 31007 master.cpp:3713] Framework failover timeout, removing framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.364966 31007 master.cpp:4271] Removing framework 20150204-151306-1176176138-5050-30988-1936 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.365542 31007 hierarchical_allocator_process.hpp:360] Removed framework 20150204-151306-1176176138-5050-30988-1936
I0210 10:28:03.394796 31011 master.cpp:1383] Received registration request for framework 'Hadoop: (RPC port: 9001, WebUI port: 50030)' at [email protected]:42973
I0210 10:28:03.395130 31011 master.cpp:1447] Registering framework 20150204-151306-1176176138-5050-30988-1937 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.395417 31005 hierarchical_allocator_process.hpp:329] Added framework 20150204-151306-1176176138-5050-30988-1937
I0210 10:28:03.396935 30996 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1937 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973
I0210 10:28:03.583683 31005 http.cpp:478] HTTP request for '/master/state.json'
I0210 10:28:03.737133 30998 master.cpp:3843] Sending 1 offers to framework 20150204-151306-1176176138-5050-30988-1937 (Hadoop: (RPC port: 9001, WebUI port: 50030)) at [email protected]:42973

This too, continuing in a loop like this forever.

My questions:

Have anyone successfully been able to run the Hadoop jobtracker in a Docker container?
Are there any handy properties I've missed for mitigating the symptoms that I see here?
I've reckoned that the code for the Mesos Hadoop hangs around line 189 in the MesosScheduler where the driver.start() is called. Or looking at the mesos code, perhaps during the initialisation?

Sorry for the long wall of text here, but I didn't want to exclude any (perhaps) important details. Appreciate any feedback, also those not necessarily giving away "the solution". :-)

Hadoop on Mesos uses only one node?

I’m having an issue running Hadoop job on Mesos cluster. I have followed README and was suceccsful starting JobTracker and running wordcount example Hadoop job on cluster.

However, when I try to launch a larger job (Camus export data from Kafka to HDFS) I see only one TaskTracker started `allocating only 2 Map slots (default configured for a node) and not using any other nodes (cluster consists of 5 nodes and I requested total 30 Map tasks).

In my setup I use Cloudera Hadoop distribution version 2.6.0-cdh5.4.2
Mesos version 0.22.1
And latest mesos-hadoop-mr1-0.1.1 (git commit c972174)

What am I missing? Or is it intended behavior?
Thanks.

Is Hadoop 2.2 supported?

Is Hadoop 2.2 supported? or twhen do you you think to support it?

Thanks in advance
Matteo
http://www.redaelli.org/matteo/

Skip extracting hadoop distribution on Task_Tracker creation

Is it really necessary to keep extracting the hadoop distribution every time (referenced by mapred.mesos.executor.uri )? Spark imo does this somewhat smarter by just referencing (if no URI is given), the current hadoop distribution's path to the (equivalent of) Task_Tracker. This often holds on clusters with shared storage among the nodes. This would really shorten the startup time, or do I have an configuration error that I'm seeing all those extractions?
Besides the extra job slowdown this incurs, it's also very wasteful disk-space wise.

Can't launched Tasktracker

*logs/hadoop-hadoop-jobtracker-e1bc6944193b.log *

2016-04-15 01:27:11,987 INFO org.apache.hadoop.mapred.ResourcePolicy: Launching task Task_Tracker_1 on http://10.102.0.7:31597 with mapSlots=2 reduceSlots=1
2016-04-15 01:27:11,987 INFO org.apache.hadoop.mapred.ResourcePolicy: URI: hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz, name: hadoop-2.5.0-cdh5.2.0.tar.gz
2016-04-15 01:27:12,019 INFO org.apache.hadoop.mapred.ResourcePolicy: Unable to fully satisfy needed map/reduce slots: 1 map slots remaining
2016-04-15 01:27:12,407 INFO org.apache.hadoop.mapred.MesosScheduler: Status update of Task_Tracker_1 to TASK_FAILED with message
2016-04-15 01:27:12,407 INFO org.apache.hadoop.mapred.MesosScheduler: Removing terminated TaskTracker: http://10.102.0.7:31597
2016-04-15 01:27:12,989 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
Mesos cluster Sandbox logs

I0415 01:28:02.248667 1514 logging.cpp:172] INFO level logging started!
I0415 01:28:02.249013 1514 fetcher.cpp:409] Fetcher Info: {"cache_directory":"/tmp/mesos/fetch/slaves/20160414-122017-100689418-5050-344-S2/hadoop","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz"}}],"sandbox_directory":"/var/lib/mesos/slaves/20160414-122017-100689418-5050-344-S2/frameworks/20160415-012211-100689418-5050-1018-0000/executors/executor_Task_Tracker_99/runs/aca23ee7-fd55-4c6c-b01a-c3d292fb2ba9","user":"hadoop"}
I0415 01:28:02.253449 1514 fetcher.cpp:364] Fetching URI 'hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz'
I0415 01:28:02.253487 1514 fetcher.cpp:238] Fetching directly into the sandbox directory
I0415 01:28:02.253516 1514 fetcher.cpp:176] Fetching URI 'hdfs://10.102.0.6:9000/hadoop-2.5.0-cdh5.2.0.tar.gz'
mesos-fetcher: ../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: const string& Try::error() const [with T = bool; std::string = std::basic_string]: Assertion `data.isNone()' failed.
*** Aborted at 1460683682 (unix time) try "date -d @1460683682" if you are using GNU date ***
PC: @ 0x7f90ffe5ccc9 (unknown)
*** SIGABRT (@0x5ea) received by PID 1514 (TID 0x7f9105b8d7c0) from PID 1514; stack trace: ***
@ 0x7f91001fb340 (unknown)
@ 0x7f90ffe5ccc9 (unknown)
@ 0x7f90ffe600d8 (unknown)
@ 0x7f90ffe55b86 (unknown)
@ 0x7f90ffe55c32 (unknown)
@ 0x460d43 Try<>::error()
@ 0x450f8d downloadWithHadoopClient()
@ 0x451f7c download()
@ 0x452347 fetchBypassingCache()
@ 0x45315f fetch()
@ 0x4539c5 main
@ 0x7f90ffe47ec5 (unknown)
@ 0x450459 (unknown)
Aborted (core dumped)
Failed to synchronize with slave (it's probably exited)

Map/Reduce slot allocation is not ideal for small clusters

In some situations, when using this framework in a very small mesos cluster (enough for only a few tasks) only map slots will be allocated and zero reduce slots. This can cause a dead-lock and although the effect should be reduced by the introduction of #33 (once the map slots became idle, reduce slots would take oveR) in some cases with certain resource allocations it can still occur.

I guess we should do a better job at allocating a decent map/reduce slot ratio.

Deploy CDH5.0.2 tarball with MRv1

Any guide about deploying CDH5.0.2 tarball with MRv1?
My plan is to deploy a pure hadoop(without mesos) first,
then connect hadoop with mesos
but seams hard to deploy CDH5.0.2 tarball with MRv1, any guide about this?
Thanks

Configuring the max map/reduce slots per task tracker to zero causes no offers to be accepted

Kind of obvious, but something I misconfigured and took be a really long time to get to the bottom of. I just assumed if I configured it to zero no maximum would apply (which was the desired behaviour).

Perhaps it's something that should be documented, or an exception be thrown to highlight the configuration error. Alternatively make zero mean no maximum?

What's the reason you might want to limit the number of slots per TT?

Can not launch TaskTracker (Error occurred during initialization of VM)

I run wordcount demo use hadoop-2.5.0-cdh5.3.2 (hadoop-mapreduce1-project) + mesos-0.23.0-rc4
Found “launched but no heartbeat yet” in jobTracker log all the time
in task's stdout:

CPLUS_INCLUDE_PATH=/opt/lib/boost_1_58_0
MANPATH=/opt/lib/mvapich2.2/share/man:/opt/compiler/gcc-4.8.2/man:/usr/share/man
HOSTNAME=dn-137-211
...
HISTSIZE=1000
HADOOP_HOME=/home/hadoop/hadoop2
HADOOP_DEV_HOME=/home/hadoop/hadoop2
LIBRARY_PATH=/opt/compiler/gcc-4.8.2/lib64:/opt/lib/cuda-6.5/lib
MESOS_DIRECTORY=/home/mesos/slave/slaves/20150918-160900-3549014208-5050-14584-S0/frameworks/20150923-170913-3549014208-5050-6111-0008/executors/executor_Task_Tracker_0/runs/39ece48d-1f91-446b-ae49-72a0a6f66346
FPATH=/opt/lib/mvapich2.2/include
OLDPWD=/home/mesos/slave/slaves/20150918-160900-3549014208-5050-14584-S0/frameworks/20150923-170913-3549014208-5050-6111-0008/executors/executor_Task_Tracker_0/runs/39ece48d-1f91-446b-ae49-72a0a6f66346
SSH_TTY=/dev/pts/3
LC_ALL=C
USER=root
.....
LD_LIBRARY_PATH=/opt/lib/mvapich2.2/lib:/opt/lib/mvapich2.2/lib/shared:/opt/lib/liblmdb-0.9/lib:/opt/lib/protobuf-2.5/lib:/opt/lib/gflag-1.4.0/lib:/opt/lib/glog-0.3.3/lib:/opt/lib/boost_1_58_0/stage/lib:/opt/lib/opencv-2.4.9/lib:/opt/lib/log4cplus-1.2.0-rc3/lib:/opt/tool/intel/lib/intel64:/opt/tool/intel/mkl/lib/intel64:/opt/compiler/gcc-4.8.2/lib64:/opt/lib/cuda-6.5/lib64
MESOS_EXECUTOR_ID=executor_Task_Tracker_0
CPATH=/opt/lib/mvapich2.2/include:/opt/lib/liblmdb-0.9/include:/opt/lib/protobuf-2.5/include:/opt/lib/gflag-1.4.0/include:/opt/lib/glog-0.3.3/include:/opt/lib/opencv-2.4.9/include:/opt/lib/log4cplus-1.2.0-rc3/include:/opt/lib/cuda-6.5/include
HADOOP_MAPARED_HOME=/home/hadoop/hadoop2
PATH=/home/hadoop/spark/bin:/home/hadoop/spark/sbin:/home/hadoop/spark/lib:/opt/scheduler/mesos-0.23.0-rc4/libexec/mesos:/opt/scheduler/mesos-0.23.0-rc4/libexec:/opt/scheduler/mesos-0.23.0-rc4/bin:/opt/scheduler/mesos-0.23.0-rc4/sbin:/opt/scheduler/mesos-0.23.0-rc4/lib:/home/hadoop/hadoop2/sbin:/home/hadoop/hadoop2/bin:/opt/tool/git-2.4.5/bin:/opt/tool/git-2.4.5/bin:....6.5/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/root/bin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin:/home/hadoop/pig/bin:/opt/scheduler/mesos/bin:/opt/scheduler/mesos/sbin
HDFS_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
MESOS_HOME=/home/mesos
HADOOP_HDFS_HOME=/home/hadoop/hadoop2
PWD=/home/mesos/slave/slaves/20150918-160900-3549014208-5050-14584-S0/frameworks/20150923-170913-3549014208-5050-6111-0008/executors/executor_Task_Tracker_0/runs/39ece48d-1f91-446b-ae49-72a0a6f66346/hadoop-2.5.0-cdh5.3.2
HADOOP_COMMON_HOME=/home/hadoop/hadoop2
F90=gfortran
MESOS_NATIVE_JAVA_LIBRARY=/opt/scheduler/mesos/lib/libmesos-0.23.0.so
JAVA_HOME=/opt/lib/jdk
MESOS_NATIVE_LIBRARY=/opt/scheduler/mesos/lib/libmesos.so
HADOOP_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
HADOOP_OPTS=-Xmx4096m -XX:NewSize=1365m -XX:MaxNewSize=2457m
MESOS_SLAVE_PID=slave(1)@192.168.137.211:5051
MESOS_FRAMEWORK_ID=20150923-170913-3549014208-5050-6111-0008
MESOS_PATH=/opt/scheduler/mesos-0.23.0-rc4
MESOS_CHECKPOINT=0
SHLVL=2
HOME=/root
LIBPROCESS_PORT=0
YARN_CONF_DIR=/home/hadoop/hadoop2/etc/hadoop
MESOS_SLAVE_ID=20150918-160900-3549014208-5050-14584-S0
MODULESHOME=/usr/share/Modules
HADOOP_BIN=/home/hadoop/hadoop2/bin
...
_=/bin/env
Error occurred during initialization of VM
Too small initial heap for new size specified```

Is the offering logging misleading?

Hi,

I can't get my cluster of 80 CPUs and 200GB+ of mem to allocate the last 8.5 CPUs.
In the logging I can see this repeatedly:

15/03/09 11:56:12 INFO mapred.ResourcePolicy: Declining offer with insufficient resources for a TaskTracker:
  cpus: offered 0.8499999940395355 needed at least 0.15000000596046448
  mem : offered 20182.0 needed at least 368.0
  disk: offered 1859053.0 needed at least 0.0
  ports:  at least 2 (sufficient)

I'm not sure why the hadoop/mesos is declining, every resource demand has been met.

Better support for HA JobTrackers

I've not got this up and running yet, so just thinking it through. It'd be great (and a requirement for me) to support HA jobtrackers. I think there are a couple of small issues that currently prevent that, though please correct me if i'm wrong. Is anyone else successfully running HA JobTrackers on Mesos?

Don't set the mapred.job.tracker option to a host:port combo. This isn't how HA jobtrackers work, you list (elsewhere in configuration) the name:host:port combo's and reference them with an alias here. Due to the way the configuration is passed to the TaskTrackers, the only issues I can find here is #27. I can confirm with this fix applied I can at least get TT's to launch and run with HA jobtracker config.
Allow all JobTracker's to register with the same Mesos framework ID. When using the automatic JobTracker failover (using the ZooKeeper Failover Controller) feature hadoop will keep all jobs and tasks running, however I think due to the fact each MesosScheduler will register with it's own framework identifier all the TaskTrackers will be killed. If they both register and use a large failover timeout, would this issue be solved?

Looking forward to anyone's thoughts...

Running job deadlock

So this is an interesting issue. I've seen (several times now) situations where the Hadoop scheduler will get itself into a deadlock with running jobs. Here's how it goes.

Cluster: Some number of mesos slaves, let's say the resources equate to 100 slots. The underlying scheduler here is the hadoop FairScheduler, not the FIFO one.

Empty cluster.
Launch MAP ONLY job with 1000 tasks.
Framework will launch N task trackers with a sum() of 100 MAP slots (no reduce, no need to do that).
The job will start running ticking through those tasks.
While the first job is running, launch a second with 1000 MAP and 5 REDUCE.
Hadoop with share the currently running task trackers and slots evenly between the two tasks.
The first job completes fine, no issues.
Every task tracker in the cluster has some map output from the second job, that data needs to be streamed to the reducers later on. Because of this, the framework won't kill the TTs and the resources are not released.
The job is waiting for reducers, but there are no freed resources to launch some TTs with reduce slots.
The job will never complete, hadoop will never release the mesos resources.

At this point, all the cluster resources are being given to the running TaskTrackers. These resources are not going to be released until the running job completes, but that job is waiting for some reducers to launch. This is a deadlock, the job will never complete because the task trackers are never released, and vice versa.

I'm wondering if you can suggest anything here @brndnmtthws @florianleibert @benh?

This problem fits quite well with the Task/Executor relationship. In this example I need to keep the executors alive (so they can stream data to the reducers for shuffle/sort) but I need to free up the "slots" or task resources. Perhaps if the framework was able to terminate the Task that held resources for the slots independently from the TaskTracker itself, and then internally mark that Task Tracker as "going to be killed soon".

We have to maintain some state internally because it is not possible to reduce the number of slots on a task tracker while it is running, so the hadoop/mesos scheduler needs to pro-actively not schedule tasks there. Though I don't think this is too complicated to do.

Deadlock Between MesosScheduler and JobTracker

Output from jStack:

Attaching to process ID 22531, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.45-b02
Deadlock Detection:

Found one Java-level deadlock:

"IPC Server handler 4 on 7676":
waiting to lock Monitor@0x00007f01ec11b858 (Object@0x00000000831924a0, a org/apache/hadoop/mapred/MesosScheduler),
which is held by "pool-1-thread-1"
"pool-1-thread-1":
waiting to lock Monitor@0x00007f01ec32f9d8 (Object@0x00000000830f4310, a org/apache/hadoop/mapred/JobTracker),
which is held by "IPC Server handler 4 on 7676"

Found a total of 1 deadlock.

Launch separate TaskTracker instances for Map and Reduce slots

With the recent enhancements that landed related to freeing up some resources when a TaskTracker becomes idle, Hadoop is a little less greedy about holding onto cluster resources when it's not actually using them. However, because this is based on the whole TaskTracker being idle, we don't get the best chance of freeing resources when TTs have mixed slots, both map and reduce.

We should launch separate TTs for map and reduce slots. To do this effectively, we probably want to try and bunch up a many map or reduce slots onto each node as possible, as opposed to the current logic, which is to apply the map/reduce slot ratio to each incoming offer. Take the following example...

1 Slot = 1 CPU and 1GB RAM

Offers:

Slave (1) 10 CPUs, 10GB RAM
Slave (2) 10 CPUs, 10GB RAM
Slave (3) 10 CPUs, 10GB RAM

Pending tasks:

1000 Map
100 Reduce

Current result:

Slave(1) -> TaskTracker(9 Map, 1 Reduce)
Slave(2) -> TaskTracker(9 Map, 1 Reduce)
Slave(3) -> TaskTracker(9 Map, 1 Reduce)

Ideal Result:

Slave(1) -> TaskTracker(10 Map)
Slave(2) -> TaskTracker(10 Map)
Slave(3) -> TaskTracker(7 Map)
Slave(3) -> TaskTracker(3 Reduce)

Flexible per-job task trackers

This is something i've been thinking about for a while, and i'd be interested if anyone else (@brndnmtthws @florianleibert?) has any contributions to the idea.

Currently the Hadoop on Mesos framework will treat every job equal, they require equal CPU, equal Memory and will run in the same TaskTracker (mesos task) environment. This is not actually always the case, and making the framework more intelligent could reap some great benefits...

Per-job environment configuration.
- This is even more useful when you think about Docker+Mesos. The ability to use different docker images for different jobs with different dependencies is very powerful.
Lean resource consumption.
- At the moment you need to cater for the peaks for best performance, especially when dealing with memory. If one out of 100 jobs requires 10x more ram, they must all be allocated the max memory.
The ability to make use of Mesos resource roles.
- This is incredibly powerful as you could essentially tag a set of resources to only be used by a specific set of M/R jobs. Given certain types of SLAs and very long running jobs (>24 hours) this is a useful thing to have... and not currently possible.

Of course spinning up a Hadoop TT for every job might be a little excessive, so the scheduler could be more intelligent and bucket types of jobs to types of task trackers. The Job.Task->TaskTracker assignment would need to change too, I guess.

In doing this the framework starts to become on par with YARN, or even more efficient, as we're able to share TTs between jobs that can share. As far as I'm aware YARN will launch a little JT and TTs for each job you submit? I'm probably wrong though.

The third point (roles) is the one i'm most interested in seeing first.

(Perhaps something for the #MesosCon hackathon 😄)

Build failed for mesos with version>=0.27.0

with mesos 0.27.0
Class org.apache.mesos.Protos.CommandInfo.ContainerInfo is removed (deprecated before 0.27.0)
as a result, mesos-hadoop build faild with mesos version >=0.27
mesos-hadoop code update is required to match new mesos version.
Thank you very much

Kill Task reports Finished and then waits for tasks to finish

In kill tasks lines 128-138. scheduleSuicideTimer doesn't block so the thread creating the TASK_FINISHED message finishes possible (and in some cases often) before the executor kills the task. This can leave processes open and cause offers to go out to new tasks before they're available, leading to crashes due to port conflicts etc.

I think the solution is to put blocking version of scheduleSuicideTimer() in the thread that builds the task finished mehtod.

Thoughts?

Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor

Even after copying the hadoop-mesos-0.1.0.jar into the share/hadoop/common/lib folder.. i get this error. I could see that Jobtracker is loading this lib. I could see this loib being loaded in jobtracker message "STARTUP_MSG: classpath = ..."

I am using ubuntu 18.04
java openjdk version "1.8.0_252"

Getting hadoop-mesos to work on older hadoop

I'm trying to run this project on older mapreduce, 1.0.x. If it were up to me I wouldn't be using this but I have to. I've already successfully set this up under the recommended 2.5.0 version and I had no problems at all, so this is probably something specific to 1.0.x.

So, trying to get this to work eventually leads me to the following error/log on a mesos-executor

cat /local/vdbogert/var/lib/mesos/slaves/20150420-104917-33592586-5050-17145-S0/frameworks/20150424-154625-234919178-5050-2360-0000/executors/executor_Task_Tracker_97/runs/latest/stderr
I0424 15:47:49.596091  9570 exec.cpp:132] Version: 0.21.0
I0424 15:47:49.610085  9586 exec.cpp:206] Executor registered on slave 20150420-104917-33592586-5050-17145-S0
15/04/24 15:47:49 INFO mapred.MesosExecutor: Executor registered with the slave
15/04/24 15:47:49 INFO mapred.MesosExecutor: Launching task : Task_Tracker_97
java.lang.NoSuchMethodError: org.apache.hadoop.mapred.JobConf.writeXml(Ljava/io/Writer;)V
        at org.apache.hadoop.mapred.MesosExecutor.configure(MesosExecutor.java:48)
        at org.apache.hadoop.mapred.MesosExecutor.launchTask(MesosExecutor.java:80)
Exception in thread "Thread-1" I0424 15:47:49.731550  9586 exec.cpp:413] Deactivating the executor libprocess

Has anybody got this to work under older hadoop distro's? If not, can someone estimate how much work it would be, solving problems like the above.

And as last question, how would I solve the above error.

Thanks

Need help in configuring hadoop on mesos cluster

My requirement is to configure hadoop on mesos where mesos and hadoop will be installed on different servers. I have below queries on that

Do I need to configure hadoop on the same server as mesos master server or can they be on different servers? (my requirement is to configure hadoop & mesos on different servers)
Is it necessary to configure the hadoop first before proceeding with the steps on this link https://github.com/mesos/hadoop.

I am a storage guy and hadoop/mesos are totally new to me. Any help on this with basic information would be really appreciated.
Thanks.

Maven cannot build package

Hello!
I am using Maven 3.3.3, and when I followed the first few instructions in the README, I got this result:

me@server# mvn package

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building mesos-hadoop-mr1 0.1.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://clojars.org/repo/org/clojars/brenden/metrics-cassandra/3.1.0/metrics-cassandra-3.1.0.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.926 s
[INFO] Finished at: 2015-08-05T12:34:48-06:00
[INFO] Final Memory: 12M/239M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project mesos-hadoop-mr1: Could not resolve dependencies for project com.github.mesos:mesos-hadoop-mr1:jar:0.1.1-SNAPSHOT: Failed to collect dependencies at org.clojars.brenden:metrics-cassandra:jar:3.1.0: Failed to read artifact descriptor for org.clojars.brenden:metrics-cassandra:jar:3.1.0: Could not transfer artifact org.clojars.brenden:metrics-cassandra:pom:3.1.0 from/to clojars.org (http://clojars.org/repo): Access denied to: http://clojars.org/repo/org/clojars/brenden/metrics-cassandra/3.1.0/metrics-cassandra-3.1.0.pom , ReasonPhrase:Forbidden. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

My machine is behind a proxy, but the proxy settings are correctly configured, and Maven is able to download other dependencies. This one in particular fails, even though I can access it in lynx, and download the pom in the URL using wget.

Running multiple instances

One of the Mesos promises was being to be able to run multiple instances of 1 type of framework, like Hadoop in this case. The original Mesos paper showed 2 concurrent Hadoop instances.

I've tried to setup an experiment where I have 2 jobtrackers each with one and the same Mesos backend, though if 1 jobtracker's queue is full enough it takes all resources and keeps them indefinitely, starving the other jobtracker.

Correct me if I'm wrong, but shouldn't the aim of this hadoop-on-mesos backend be, to cater this scenario of multiple instances?

mapred.job.tracker wrong hostname

I have the following mapred.job.tracker setting in core-site.xml

but in stderr I see that mapred.job.tracker somehow got trimmed:

so tasktrackers try to connect to the wrong host.

hadoop 2.3.0-cdh5.0.2
mesos 0.19.0

The url pseudo distributed operation in configure tab is 404

As image shown below, the pseudo distributed operation is not found when click it.

Style Enforcement

Currently there is no style enforcement except by committers after a PR. This is inefficient for everyone. The maven check-styles plugin can help with this. I was thinking of something pretty lax to start just watching for white space and unused imports. Thoughts?

I'm happy to work this but this is a thorny issue, so I'd like committer feedback first.

Framework Authentication

I needed framework authentication, I already implemented this and the ability to specify a framework name (It's easier for ACLS). Will submit PR after updating documentation.

Collecting logs from failed task attempts

This is something I've found a little painful while experimenting with jobs. Given that on a fairly quiet cluster, task trackers are being launched and terminated quite quickly, there's no way of accessing the stdout/stderr logs of a task attempt through the web UI.

I'd be interested to know if anyone else has found this frustrating? I know you can grab the logs via the mesos web UI... though it's a little cumbersome.

Spark Mesos Docker containerizer cannot run

hadoop no mesos + spark on mesos + oozie

Hadoop map reduce job is running properly.A Spark job make use of Mesos Docker containerizer. when oozie used a map reduce job to schedule that Spark job. Mesos Docker containerizer cannot run.
I’m not sure this Question, any idea on how to resolve this problem?

Build mesos.version ‘0.23.1’.
Hadoop cluster mesos.version ‘1.0.1’.

mesos task running stderr：
I0213 15:13:47.948076 13351 fetcher.cpp:547] Fetched '/etc/docker.tar.gz' to '/var/lib/mesos/slaves/8a2d57b7-3a9b-478d-8fc3-19e3e573ab6c-S18/frameworks/8a2d57b7-3a9b-478d-8fc3-19e3e573ab6c-1166/executors/driver-20170213151343-124182/runs/a8eb3fa4-1919-4b26-a05f-ad3fdcc32aed/docker.tar.gz'
I0213 15:13:48.161221 13629 exec.cpp:161] Version: 1.0.1
I0213 15:13:48.165864 13634 exec.cpp:413] Executor asked to shutdown

[Executor registered on agent ] information not be found.It seems like Registered failed, so mesos Docker containerizer cannot run.

mesos system mesos-slave.ERROR
E0213 15:13:48.616093 9896 slave.cpp:2621] Status update acknowledgement (UUID: 5953d089-e259-4f63-9eb6-ca97f73f10f7) for task Task_Tracker_0 of unknown executor

Support for the new ContainerInfo (and Mesos<>Docker)

Currently the Hadoop on Mesos framework only supports the old-style container info protos (used for the External Containerizer). We should also add support for docker container info.

mesos / hadoop Goto Github PK

hadoop's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs