mesos / kafka Goto Github PK

Apache Kafka on Apache Mesos

License: Apache License 2.0

Scala 97.53% Shell 1.59% Ruby 0.48% Java 0.40%

kafka's Introduction

Apache Mesos Repository Has Moved

Apache Mesos is now a Top-Level Apache project, and we've moved the codebase. The downloads page explains the essential information, but here's the scoop:

Please check out the source code from Apache's git repostory:

git clone https://git-wip-us.apache.org/repos/asf/mesos.git

or if you prefer GitHub, use the GitHub mirror:

git clone git://github.com/apache/mesos.git

For issue tracking and patches, we use Apache-maintained infrastructure including the JIRA issue tracker instead of the GitHub one, and review board for patches instead of pull requests.

Other information including documentation and a getting started guide are available on the Mesos website: http://mesos.apache.org

Thanks!

-- The Mesos developers

kafka's People

Contributors

Stargazers

Watchers

Forkers

edgefox dharmeshkakadia asteris-llc mccraigmccraig mbrukman guenter misakai llparse ruo91 florianleibert mesosphere-backup tnachen vidhyaarvind intellifora banno kzarzycki-advertine mrtheb potto007 free-luowei serejja audiencescience chengat1314 dmitryfill sujeetv zznate zircote sridharmamella nuggad smorin yonglehou vadio jeff-cloud imrangit cbalan aglahe eddyzags bbnsumanth rowhit dallasmarlow ninglipeng erikdw tsingson1988 emdem hylke1982 elodina olegkovalenko craigmartin ikerlan-digital giaosudau mcoffin frankscholten gholmes edvorkin coboosting cebufooddroid vanco tigerqiu712 pugna0 justizin ee08b397 williamd1618 is00hcw vmahedia tc-dc moretea moonkev datastrophic hubertwng gmohandass fhalim vivint-smarthome vixns jordmoz zazrivec shanicky liorze clehene pedroarthur sandeepdeva donaldchanshining lishaohua403 freshetdms wenjixin mesosinfo jinyu0310 wonderslug ysusuk fuji-151a shangd dgutierrez-stratio rombob eli-jordan mhanlonhmh nokia liufybj habibtalib veeruns plaflamme codenamelxl mindscratch

kafka's Issues

Document LIBPROCESS_PORT

Perhaps this is fairly common knowledge, but could the fact that you can set the LIBPROCESS_PORT to communicate with mesos be documented? It took a while for me to track down a communication issue because I didn't realize that this port was being opened. I only tried it because one of my coworkers suggested it since we had to do the same thing for Chronos.

Support offers containing several resources with the same name (cpus, mem, etc) and different role

If not using roles, mesos offer contains single resource for each name (cpus, mem, etc). But if using roles it could contain duplicated resources:

master#-O574 cpus:0.30 cpus(kafka):0.70 mem:50.00 mem(kafka):100.00 disk:35164.00 ports:[31000..32000]

This should be supported in:

ly.stealth.mesos.kafka.Broker#matches
org.apache.mesos.Protos.TaskInfo.Builder#addResources(org.apache.mesos.Protos.Resource.Builder)
as mentioned in #92
ly.stealth.mesos.kafka.Util.Str#resources

Unit test for Broker.matches should be modified to cover this case.

broker placement stickyness for slave failure

We need a way so that if a slave dies (e.g. kernel panic) and reboots there is a time window that if the slave comes back within that window of time it will get replaced on the same broker. If the window has passed (meaning the operator has deemed the failure time long enough) that the broker will get scheduled on another slave. We should also please cleanup the readme for the existing failover timeouts calling them for the broker failures and having continued failures when launching on the slave before moving to another.

Defaults for broker memory cause OOM

The default heap size for brokers and the default container memory limit are both 128M. Brokers get OOM-killed with these settings because the JVM uses memory beyond the heap.

More sane defaults would be 1G for heap and 2G for memory limit.

Support Multiple Kafka clusters

So one approach I guess would be to run multiple versions of this frameworks against different znodes in zk (for HA mode), but I'd really like to be able to manage multiple kafka clusters with this.

We have several teams internally that use kafka, and have totally different configuration requirements. Some care more about replication and durability, some care more about performance, some have a bazillion topics and stripe over them with little to no replication.

It would be really nice to run a single instance of this framework using all of them.

Kafka command script should conform with DCOS cli style

Currently the flags are defined as:
Usage: {help {command}|status|add|update|remove|start|stop|rebalance}

Since DCOS kafka CLI is calling the jar directly, we want the syntax to conform with the oher services, such as:

dcos kafka --help:

dcos kafka status
dcos kafka add
dcos kafka update
dcos kafka remove
dcos kafka start
dcos kafka stop
dcos kafka rebalance

And then each subcommand should support its individual help command
dcos kafka status --help

dcos kafka status <options.....>
This returns the status of your kafka brokers, etc....

Ability to specify static service ports for brokers

Since Kafka has gone to a an approach where clients don't use zookeeper to find brokers (this is helpful for Mesos as brokers land on random mesos assigned ports) should the scheduler use marathon or some marathon like setup so we can specify service ports per broker? With Service ports and something like mesos dns, we'd have a list of static brokerlists we can use with clients, since clients may not be aware of mesos dns, we can't use that for ports (just names) and since we we don't know which ports the broker will land on, the list could change. Right now, I am taking some of my kafka clients and "starting" them by by rewriting my confs on the client each time, not scalable and prone to error.

add support for dynamic reservations

http://mesos.apache.org/documentation/latest/reservation/

broker state show active but brokers not started

This should be reproducible if trying to start a broker when the scheduler isn't getting offers. The broker should not be active.

We need at least one more state, staging perhaps. We also should differentiate if the broker is not launching because a) there are offers, but the offers aren't what is required. It would be good to see this delta too some more 7.2 CPU might be just as good as 8 or such b) the scheduler isn't getting any offers. I think if we have two states one for each we will know where the scheduler is with each broker.
We should have metrics around each state. It is going to be important to monitor this over time. This is somewhat tied to #77

Lost scheduler/Broker

When the slave machine running the scheduler is taken out of the cluster or stops responding. The scheduler is not restarted by marathon and it hangs indefinitely. Attaching image. Destroy /scale up down does not work as well.
When the slave machine running broker is taken out of the cluster or stops responding. I see Broker Lost message but it's not scheduled in another machine or stopped. scheduler still shows broker is active

Kafka Broker should use Mesos Slave's hostname as host.name and advertised.host.name?

I hit an issue producing a message on the Kafka Broker cluster on mesos, after I started my scheduler process using Marathon and created Kafka Broker Clusters using the REST API.

The issue is like:
Error while fetching metadata [{TopicMetadata for topic mytopic ->
No partition metadata for topic mytopic due to kafka.common.LeaderNotAvailableException}] for topic [mytopic]: class kafka.common.LeaderNotAvailableException (kafka.producer.BrokerPartitionInfo).

After debugging it, I realized the problem is related to Kafka Broker started with the mesos slave VM's hostname not the Mesos slave's hostname ( changed to IP address in my environment).

I made the following change to workaround the issue.:

diff --git a/src/scala/ly/stealth/mesos/kafka/Scheduler.scala b/src/scala/ly/stealth/mesos/kafka/Scheduler.scala
index b34829e..d9eec9f 100644
--- a/src/scala/ly/stealth/mesos/kafka/Scheduler.scala
+++ b/src/scala/ly/stealth/mesos/kafka/Scheduler.scala
@@ -61,7 +61,9 @@ object Scheduler extends org.apache.mesos.Scheduler {
val overrides: Map[String, String] = Map(
"broker.id" -> broker.id,
"port" -> ("" + port),

   "zookeeper.connect" -> Config.kafkaZkConnect

   "zookeeper.connect" -> Config.kafkaZkConnect,

```
   "host.name" -> offer.getHostname,
```
```
   "advertised.host.name"->offer.getHostname
```
)

val options = Util.formatMap(broker.effectiveOptions(overrides))

And comments? Thanks.

Yang.

Kafka brokers are getting lost

I'm trying to run Kafka scheduler within Marathon with the following job description. I was able to get the scheduler working but the brokers come up but keep getting lost (as seen in the scheduler log and mesos dashboard). What am I doing wrong here. Please note that in my cluster, mesos-master node will never run mesos-slave role.

{
  "id": "/platform/kafka",
  "instances": 1, 
  "cpus": 2,
  "mem": 2048,
  "ports": [8000],
  "container": {
        "type": "DOCKER",
        "docker": {
            "image": "private-registry/kafka-mesos:201508032203",
            "forcePullImage": true,
            "network": "HOST"
        }
  },  
  "healthChecks": [
    {"path": "/api", "protocol": "HTTP"}
  ],
 "env": {
        "MESOS_NATIVE_JAVA_LIBRARY": "/usr/local/lib/libmesos.so"
  },
  "cmd": "cd /kafka-mesos && ./kafka-mesos.sh scheduler --storage zk:/mesos-kafka --master zk://<zk-node>:2181/mesos --zk <zk-node>:2181 --api http://$HOSTNAME:8000"
}

Also, the instructions (here)[https://github.com/mesos/kafka/tree/master/src/docker#running-image-in-marathon] suggests to set the api as master:7000 but I can not do it since I don't have mesos running as a slave too in mesos master node. Hence, I had to use $HOSTNAME to get the slave this will get scheduled.

Really appreciate if someone can help to fix this issue.

Provide simple WEB UI

I think that it would be very convenient for most of the users to have WEB UI (together with CLI).
A simple WEB UI could be added to be embedded inside HttpServer.

It could be a single-page-app, containing just 2 tabs:

brokers - allows to manage brokers (list|add|update|remove|start|stop)
topics - allows to manage topics (list|add|update|rebalace);

I know that this would require some additional effort, but it should not very big and, imho, some basic version could be implemented and delivered in 5-10 days. We could start just with brokers tab, for instance.

Kafka repeatedly failing healthcheck

Every time I launch the kafka package it fails the healthcheck and there is no information as to why in the logs.

There is no way to set the JMX port

This sometimes desired. Sometimes it should be dynamic and sometimes static.

So, please keep the existing functionality, allow for dynamic assignment of the port and an ovverride for static assignment. The JMX_PORT is set prior to kafka starting up (e.g. JMX_PORT=./kafka-server....)

to implement constraints refactor marathon with utilities

The Kafka framework needs constraints so it would make sense to pull that code out of Marathon and put it in mesos-utils.

https://github.com/mesosphere/mesos-utils
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/mesos/Constraints.scala

need to get travis hooked up so the branches can build and run tests

Kafka metric reporters not being started

When a Kafka server is started there is one call missing from upstream main class and this leads to kafka.metrics.reporters property being ignored entirely. #89 adds this call on server start to enable metric reporters if they are provided.

Lost broker attempts to restart on a node with an existing broker on it.

Hello,

I ran into an interesting situation today doing some testing where I killed a broker expecting it to restart. However, for a number of attempts it tired to restart on a node that already had a running broker on it. This behaviour seems counter-intuitive since all of the brokers use the same configuration for where the data directory exists, so knowing that my thought was that it should be restarted on a node that isn't running a broker.

Is there a way to prevent this from occurring? I didn't see anything in the docs, but there's always the chance I missed something.

Add CLI for topics as we already have REST API part

After merge of #95 we have REST API for managing topics.
It would be consistent to add CLI for that part of REST API.

Proposed CLI layout is following:

./kafka-mesos.sh topics list ...
./kafka-mesos.sh topics add | update ...

.2. Also for the purpose of consistency I suppose, that it would be good to modify a layout of broker-related commands.

For now we have following broker-related commands:

./kafka-mesos.sh status
./kafka-mesos.sh add | update | remove ...
./kafka-mesos.sh start | stop ...

It would be more consistent if those commands become:

./kafka-mesos.sh brokers list ...
./kafka-mesos.sh brokers add | update | remove ...
./kafka-mesos.sh brokers start | stop

Second point should be negotiated, cause will break CLI compatibility, but, imho, would be still easy to fix the clients.

supply broker property file instead of having to-do --options

we find constantly having to set 30 --option when adding broker, this would be good to just have a property file that you can loop through and set when add or update brokers.

Launch kafka with docker

Currently in the readme it's required for user to download the kafka jar to run it, but ideally we should support user specifying a docker image to use to launch kafka, and the supported docker options that makes sense for Kafka.

if the scheduler stops, don't de-register it as a framework

This happens

2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-0-63.us-west-2.compute.internal
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@724: Client environment:os.arch=3.19.0
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Thu Mar 26 10:44:46 UTC 2015
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@733: Client environment:user.name=core
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@741: Client environment:user.home=/home/core
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/core/kafka
2015-04-15 01:44:57,618:4305(0x7f625ebc0700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.0.0.63:2181 sessionTimeout=10000 watcher=0x7f62666bb5d0 sessionId=0 sessionPasswd= context=0x7f6240003d60 flags=0
I0415 01:44:57.618618 4306 sched.cpp:157] Version: 0.23.0
2015-04-15 01:44:57,621:4305(0x7f622affd700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.0.63:2181]
2015-04-15 01:44:57,622:4305(0x7f622affd700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.0.63:2181], sessionId=0x14cba649d870011, negotiated timeout=10000
I0415 01:44:57.622889 4335 group.cpp:313] Group process (group(1)@10.0.0.63:36628) connected to ZooKeeper
I0415 01:44:57.622913 4335 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0415 01:44:57.622921 4335 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0415 01:44:57.623774 4335 detector.cpp:138] Detected a new leader: (id='0')
I0415 01:44:57.623872 4330 group.cpp:659] Trying to get '/mesos/info_0000000000' in ZooKeeper
I0415 01:44:57.624270 4330 detector.cpp:452] A new leading master ([email protected]:5050) is detected
I0415 01:44:57.624322 4330 sched.cpp:254] New master detected at [email protected]:5050
I0415 01:44:57.624477 4330 sched.cpp:264] No credentials provided. Attempting to register without authentication
I0415 01:44:57.625386 4333 sched.cpp:815] Got error 'Completed framework attempted to re-register'
I0415 01:44:57.625399 4333 sched.cpp:1623] Asked to abort the driver
I0415 01:44:57.626027 4333 sched.cpp:856] Aborting framework '20150415-000411-1056964618-5050-1035-0001'
I0415 01:44:57.626668 4341 sched.cpp:1589] Asked to stop the driver
I0415 01:44:57.626737 4336 sched.cpp:831] Stopping framework '20150415-000411-1056964618-5050-1035-0001'
2015-04-15 01:44:57,540 [main] INFO org.eclipse.jetty.server.Server - jetty-9.0.z-SNAPSHOTWrappedArray()
2015-04-15 01:44:57,575 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Started WrappedArray(o.e.j.s.ServletContextHandler@52aa911c{/,null,AVAILABLE})
2015-04-15 01:44:57,587 [main] INFO org.eclipse.jetty.server.ServerConnector - Started WrappedArray(ServerConnector@6649373a{HTTP/1.1}{0.0.0.0:7000})
2015-04-15 01:44:57,587 [main] INFO ly.stealth.mesos.kafka.HttpServer$ - started on port 7000
2015-04-15 01:44:57,625 [Thread-13] INFO ly.stealth.mesos.kafka.Scheduler$ - [error] Completed framework attempted to re-register
2015-04-15 01:44:57,684 [Thread-12] INFO org.eclipse.jetty.server.ServerConnector - Stopped WrappedArray(ServerConnector@6649373a{HTTP/1.1}{0.0.0.0:7000})
2015-04-15 01:44:57,687 [Thread-12] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped WrappedArray(o.e.j.s.ServletContextHandler@52aa911c{/,null,UNAVAILABLE})
2015-04-15 01:44:57,690 [Thread-12] INFO ly.stealth.mesos.kafka.HttpServer$ - stopped

I think the best thing to-do here is not de-register the framework. If a framework id is seen in zookeeper just re-register like you are doing now so it works.

failing to start scheduler when kafka broker zk chroot path does not exist

If the kafka broker zk path is set to something that doesn't yet exist in zookeeper, an exception is raised when connecting here https://github.com/mesos/kafka/blob/master/src/scala/ly/stealth/mesos/kafka/Cluster.scala#L172.

If I am not mistaken, kafka will create that path when the brokers start if the path doesnt exist so should the scheduler create this first similar to https://github.com/mesos/kafka/blob/master/src/scala/ly/stealth/mesos/kafka/Cluster.scala#L165?

gradlew fails on centos 7

Using gradle 2.4, Centos 7 with most recent updated packages.

$ ./gradlew jar --debug
...
16:22:50.927 [DEBUG] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] ly/stealth/mesos/kafka/Scheduler.scala
16:22:50.928 [DEBUG] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] ly/stealth/mesos/kafka/Util.scala
16:22:51.050 [DEBUG] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] Scalac params = ''
16:22:59.911 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] /root/gradle/kafka/src/scala/ly/stealth/mesos/kafka/BrokerServer.scala:110: error: not found: value getClassLoadingLock
16:22:59.912 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] getClassLoadingLock(name) synchronized {
16:22:59.912 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] ^
16:23:02.469 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] warning: Class java.lang.AutoCloseable not found - continuing with a stub.
16:23:02.470 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] error: error while loading NetworkConnector, class file '/root/.gradle/caches/modules-2/files-2.1/org.eclipse.jetty/jetty-server/9.0.4.v20130625/64ac312bb641da49c73491e8c176bc1efdcd3857/jetty-server-9.0.4.v20130625.jar(org/eclipse/jetty/server/NetworkConnector.class)' is broken
16:23:02.470 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac](class java.lang.NullPointerException/null)
16:23:05.271 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] /root/gradle/kafka/src/scala/ly/stealth/mesos/kafka/Util.scala:134: error: value inheritIO is not a member of ProcessBuilder
16:23:05.272 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] possible cause: maybe a semicolon is missing before `value inheritIO'?
16:23:05.272 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] .inheritIO().redirectOutput(file).start().waitFor()
16:23:05.272 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] ^
16:23:05.573 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] one warning found
16:23:05.574 [WARN] [org.gradle.api.internal.project.ant.AntLoggingAdapter] [ant:scalac] three errors found
16:23:05.575 [DEBUG] [org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter] Finished executing task ':compileScala'
16:23:05.575 [LIFECYCLE] [class org.gradle.TaskExecutionLogger] :compileScala FAILED
16:23:05.593 [INFO] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] :compileScala (Thread[main,5,main]) completed. Took 17.373 secs.
16:23:05.594 [DEBUG] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] Task worker [Thread[main,5,main]] finished, busy: 17.452 secs, idle: 0.014 secs

We should have more metrics

We should first work on places like onBrokerStopped and such. It might be better to use Yammer metrics until the Kafka brokers move over to the Kafka Metrics in the client. We could also consider just writing these to Kafka directly.

we need a bind address for the scheduler also

As we are doing with brokers, schedulers are failing health check because the ip to bind too is the ip of the offer which is not routable for the services (just the master and slaves). We need to set a bind address for the scheduler. When we do this now moving forward the brokers should inherit that setting and have the ability to override it please.

Documentation about Persistence

Many other frameworks use HDFS for data persistence, this framework doesn't have any mentions of HDFS, so I was curious how the persistence works? Is it currently deployed using a shared SAN?

The docs in README.md mention that the framework will attempt to restart the broker on the same node if it fails. Is this framework using the persistent reservation features in Mesos 0.23?

Build tests fail if port 8000 in use

Not really an issue, but it stumped me for a a bit and It would be nice to have a note in the docs or have a check.

Full story:
I've got python SimpleHTTPServer running on port 8000 on the VM that I was trying to do the ./gradlew jar and the build fails while it is running

Add Dockerfile for kafka scheduler

We want a Dockerfile that can build the kafka scheduler and keep the artifacts (jar) to run the scheduler itself.

add new admin calls: decommission | frag | rebalance for handles auto scale up/down

This can be used after adding and starting more brokers to even out the partitions load to the entire cluster. Folks can add 5000 , start 5000 , rebalance myHighTrafficeNineAMTraffice . If the rebalance is still going on we should parse the output now more and present it to folks. Once a rebalance starts we shouldn't allow folks to try to execute another one.

For stopping a broker the intent is controlled shutdown then stop.

We should also add a command to just frag the task which we should be able to call after we do a stop and it hasn't and still running.

Store scheduler state in the state abstraction

The framework currently stores its state in a JSON file in the current directory. It should be stored with the Mesos state abstraction for persistence.

Broker options not coming back on API status calls

Are their defaults set for broker options like jvm-options? If so, can these always be returned by the /api/brokers/status API call?

They appear to be only returned when manually set via the cli or api.

We should provide a more configurable override for the log4j.properties in the broker

Changing the settings for the log4j properties currently requires packing up another tgz, stop, copy tgz to uri server, start in order to use. This could be further automated from the command line specifying this file. We should also have some of the options work from the command line that can filter down. We should share this also with the scheduler and make it easily modifiable.

kafka rebalance with a failure broker needs repair when online

I started 4 brokers , created topic with rep 3 and then add 3 more brokers and did rebalance, during that one of the brokers (id==0) died and error in partitions on status, mesos restarted broker 0 (woo hoo) but then when i ran it again it all went to replication 4 and need to have it back to 3 spread evenly.


# ./kafka-mesos.sh add 0..3
Brokers added

brokers:
  id: 0
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

  id: 1
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

  id: 2
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

  id: 3
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

# ./kafka-mesos.sh start 0..3
Brokers 0,1,2,3 started

# bin/kafka-topics.sh --zookeeper master0:2181 --topic TESTING --create --replication-factor 3 --partitions 12
Created topic "TESTING".

# ./kafka-mesos.sh add 4..6
Brokers added

brokers:
  id: 4
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

  id: 5
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

  id: 6
  active: false
  state: stopped
  resources: cpus:0.50, mem:128, heap:128
  failover: delay:10s, maxDelay:60s

# ./kafka-mesos.sh start 4..6
Brokers 4,5,6 started

# bin/kafka-topics.sh --zookeeper master0:2181 --topic TESTING --describe
Topic:TESTING PartitionCount:12 ReplicationFactor:3 Configs:
  Topic: TESTING  Partition: 0  Leader: 1 Replicas: 1,3,0 Isr: 1,3,0
  Topic: TESTING  Partition: 1  Leader: 2 Replicas: 2,0,1 Isr: 2,0,1
  Topic: TESTING  Partition: 2  Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
  Topic: TESTING  Partition: 3  Leader: 0 Replicas: 0,2,3 Isr: 0,2,3
  Topic: TESTING  Partition: 4  Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
  Topic: TESTING  Partition: 5  Leader: 2 Replicas: 2,1,3 Isr: 2,1,3
  Topic: TESTING  Partition: 6  Leader: 3 Replicas: 3,2,0 Isr: 3,2,0
  Topic: TESTING  Partition: 7  Leader: 0 Replicas: 0,3,1 Isr: 0,3,1
  Topic: TESTING  Partition: 8  Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
  Topic: TESTING  Partition: 9  Leader: 2 Replicas: 2,3,0 Isr: 2,3,0
  Topic: TESTING  Partition: 10 Leader: 3 Replicas: 3,0,1 Isr: 3,0,1
  Topic: TESTING  Partition: 11 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
root@master0:/vagrant/


# ./kafka-mesos.sh rebalance 0..6
Rebalance started: 
TESTING
  0: 1,3,0 -> 6,5,0 - running
  1: 2,0,1 -> 0,6,1 - running
  2: 3,1,2 -> 1,0,2 - running
  3: 0,2,3 -> 2,1,3 - running
  4: 1,0,2 -> 3,2,4 - running
  5: 2,1,3 -> 4,3,5 - running
  6: 3,2,0 -> 5,4,6 - running
  7: 0,3,1 -> 6,0,1 - running
  8: 1,2,3 -> 0,1,2 - running
  9: 2,3,0 -> 1,2,3 - running
  10: 3,0,1 -> 2,3,4 - running
  11: 0,1,2 -> 3,4,5 - running

# ./kafka-mesos.sh rebalance status
Rebalance is running: 
TESTING
  0: 1,3,0 -> 6,5,0 - running
  1: 2,0,1 -> 0,6,1 - running
  2: 3,1,2 -> 1,0,2 - running
  3: 0,2,3 -> 2,1,3 - running
  4: 1,0,2 -> 3,2,4 - running
  5: 2,1,3 -> 4,3,5 - running
  6: 3,2,0 -> 5,4,6 - running
  7: 0,3,1 -> 6,0,1 - running
  8: 1,2,3 -> 0,1,2 - running
  9: 2,3,0 -> 1,2,3 - running
  10: 3,0,1 -> 2,3,4 - running
  11: 0,1,2 -> 3,4,5 - running

# ./kafka-mesos.sh rebalance status
Rebalance is idle: 
TESTING
  0: 1,3,0 -> 6,5,0 - error
  1: 2,0,1 -> 0,6,1 - error
  2: 3,1,2 -> 1,0,2 - error
  3: 0,2,3 -> 2,1,3 - done
  4: 1,0,2 -> 3,2,4 - done
  5: 2,1,3 -> 4,3,5 - done
  6: 3,2,0 -> 5,4,6 - done
  7: 0,3,1 -> 6,0,1 - error
  8: 1,2,3 -> 0,1,2 - error
  9: 2,3,0 -> 1,2,3 - done
  10: 3,0,1 -> 2,3,4 - done
  11: 0,1,2 -> 3,4,5 - done

# bin/kafka-topics.sh --zookeeper master0:2181 --topic TESTING --describe
Topic:TESTING PartitionCount:12 ReplicationFactor:5 Configs:
  Topic: TESTING  Partition: 0  Leader: 1 Replicas: 0,5,1,6,3 Isr: 0,5,1,6,3
  Topic: TESTING  Partition: 1  Leader: 2 Replicas: 0,6,1,2 Isr: 2,0,1,6
  Topic: TESTING  Partition: 2  Leader: 3 Replicas: 1,0,2,3 Isr: 3,1,2,0
  Topic: TESTING  Partition: 3  Leader: 2 Replicas: 2,1,3 Isr: 2,3,1
  Topic: TESTING  Partition: 4  Leader: 3 Replicas: 3,2,4 Isr: 2,3,4
  Topic: TESTING  Partition: 5  Leader: 4 Replicas: 4,3,5 Isr: 5,3,4
  Topic: TESTING  Partition: 6  Leader: 5 Replicas: 5,4,6 Isr: 5,6,4
  Topic: TESTING  Partition: 7  Leader: 3 Replicas: 6,0,1,3 Isr: 3,1,6,0
  Topic: TESTING  Partition: 8  Leader: 1 Replicas: 0,1,2,3 Isr: 1,2,3,0
  Topic: TESTING  Partition: 9  Leader: 2 Replicas: 1,2,3 Isr: 2,3,1
  Topic: TESTING  Partition: 10 Leader: 3 Replicas: 2,3,4 Isr: 2,3,4
  Topic: TESTING  Partition: 11 Leader: 5 Replicas: 3,4,5 Isr: 5,3,4


# ./kafka-mesos.sh rebalance 0..6
Rebalance started: 
TESTING
  0: 0,5,1,6,3 -> 2,1,3,4 - running
  1: 0,6,1,2 -> 3,2,4,5 - running
  2: 1,0,2,3 -> 4,3,5,6 - running
  3: 2,1,3 -> 5,4,6,0 - running
  4: 3,2,4 -> 6,5,0,1 - running
  5: 4,3,5 -> 0,6,1,2 - running
  6: 5,4,6 -> 1,0,2,3 - running
  7: 6,0,1,3 -> 2,3,4,5 - running
  8: 0,1,2,3 -> 3,4,5,6 - running
  9: 1,2,3 -> 4,5,6,0 - running
  10: 2,3,4 -> 5,6,0,1 - running
  11: 3,4,5 -> 6,0,1,2 - running

# ./kafka-mesos.sh rebalance status
Rebalance is idle: 
TESTING
  0: 0,5,1,6,3 -> 2,1,3,4 - done
  1: 0,6,1,2 -> 3,2,4,5 - done
  2: 1,0,2,3 -> 4,3,5,6 - done
  3: 2,1,3 -> 5,4,6,0 - done
  4: 3,2,4 -> 6,5,0,1 - done
  5: 4,3,5 -> 0,6,1,2 - done
  6: 5,4,6 -> 1,0,2,3 - done
  7: 6,0,1,3 -> 2,3,4,5 - done
  8: 0,1,2,3 -> 3,4,5,6 - done
  9: 1,2,3 -> 4,5,6,0 - done
  10: 2,3,4 -> 5,6,0,1 - done
  11: 3,4,5 -> 6,0,1,2 - done

# bin/kafka-topics.sh --zookeeper master0:2181 --topic TESTING --describe
Topic:TESTING PartitionCount:12 ReplicationFactor:4 Configs:
  Topic: TESTING  Partition: 0  Leader: 1 Replicas: 2,1,3,4 Isr: 1,2,3,4
  Topic: TESTING  Partition: 1  Leader: 2 Replicas: 3,2,4,5 Isr: 5,2,3,4
  Topic: TESTING  Partition: 2  Leader: 3 Replicas: 4,3,5,6 Isr: 5,6,3,4
  Topic: TESTING  Partition: 3  Leader: 5 Replicas: 5,4,6,0 Isr: 0,5,6,4
  Topic: TESTING  Partition: 4  Leader: 6 Replicas: 6,5,0,1 Isr: 0,5,1,6
  Topic: TESTING  Partition: 5  Leader: 0 Replicas: 0,6,1,2 Isr: 0,1,6,2
  Topic: TESTING  Partition: 6  Leader: 1 Replicas: 1,0,2,3 Isr: 0,1,2,3
  Topic: TESTING  Partition: 7  Leader: 3 Replicas: 2,3,4,5 Isr: 5,2,3,4
  Topic: TESTING  Partition: 8  Leader: 3 Replicas: 3,4,5,6 Isr: 5,6,3,4
  Topic: TESTING  Partition: 9  Leader: 4 Replicas: 4,5,6,0 Isr: 0,5,6,4
  Topic: TESTING  Partition: 10 Leader: 5 Replicas: 5,6,0,1 Isr: 0,5,1,6
  Topic: TESTING  Partition: 11 Leader: 6 Replicas: 6,0,1,2 Isr: 0,1,6,2

scheduler silently fails

I have 2 kafka schedulers one for inbound and one for aggregate cluster. The first scheduler stays up . The second one silently fails every 7 mins without warning or error in the logs. I could create brokers. Could this be because of having the same zookeeper for both kafka clusters?

No JSON response on failures

The error codes seem to be fine, but there is no json output for some api errors:

$ curl -sL http://192.168.33.10:7000/api/brokers/add?id=0       
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 400 </title>
</head>
<body>
<h2>HTTP ERROR: 400</h2>
<p>Problem accessing /api/brokers/add. Reason:
<pre>    Broker 0 already exists</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>
</body>
</html>

$ curl -sL http://192.168.33.10:7000/api/brokers/stop?id=999     
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 400 </title>
</head>
<body>
<h2>HTTP ERROR: 400</h2>
<p>Problem accessing /api/brokers/stop. Reason:
<pre>    broker 999 not found</pre></p>
<hr /><i><small>Powered by Jetty://</small></i>
</body>
</html>

Brokers should use disk resources and set log.retention.bytes

Currently, a broker can fill up the disk on slaves entirely. Brokers should set log.retention.bytes and use disk resources to prevent this.

we need a better way to load in metrics libraries and log4j appender libraries

in order to get broker metrics to report (and to use a custom log4j appender) we have to un tar kafka, cp the jar into lib and re-tar. It would be better for the brokers to have additional files get downloaded from the scheduler and put in the lib directory on the slave for use. Some configuration so we can identify which file(s) in addition to the kafka tgz should get served up and downloaded and copied into lib.

Error: master when trying to use the CLI

I'm getting this when trying any of the command line action:

tobi@Zwerg ~/code/kafka (master)$ ./kafka-mesos.sh add 0
Error: master

A better errors message would be great!

3 test failures when building with java 1.8.0

jschroeder@omniscience:~/git/mesos-kafka (master)$ ./gradlew jar
:compileJava UP-TO-DATE
:compileScala
:processResources UP-TO-DATE
:classes
:compileTestJava UP-TO-DATE
:compileTestScala
[ant:scalac] Element '/home/jschroeder/git/mesos-kafka/out/gradle/resources/main' does not exist.
:processTestResources UP-TO-DATE
:testClasses
:test

ly.stealth.mesos.kafka.RebalancerTest > expandTopics FAILED
    org.I0Itec.zkclient.exception.ZkTimeoutException at RebalancerTest.scala:98

ly.stealth.mesos.kafka.RebalancerTest > start_in_progress FAILED
    org.I0Itec.zkclient.exception.ZkTimeoutException at RebalancerTest.scala:86

ly.stealth.mesos.kafka.RebalancerTest > start FAILED
    org.I0Itec.zkclient.exception.ZkTimeoutException at RebalancerTest.scala:74

87 tests completed, 3 failed
:test FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: file:///home/jschroeder/git/mesos-kafka/out/gradle/reports/tests/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 1 mins 50.398 secs

This is with hotspot 1.8.0:

jschroeder@omniscience:~/git/mesos-kafka (master)$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

Works perfectly with 1.7.x.

Unclear on starting broker -- jar locations don't seem correct

I am running the schedule via marathon:

    - role: marathon_app
      marathon_app:
        id: "/ops/kafka-mesos-scheduler"
        cmd: >-
          ./kafka-mesos.sh scheduler
          --storage zk:/kafka-mesos
          --master zk://zookeeper.service.consul:{{zookeeper_client_port}}/mesos
          --zk zookeeper.service.consul:{{zookeeper_client_port}}
          --api http://zookeeper.service.consul:7000
        cpus: 0.1
        mem: 128
        instances: 1
        uris:
          - https://s3.amazonaws.com/si-vpc-internal/vimana/ops-kafka-mesos/kafka-mesos-{{ version }}.tgz

Where kafka-mesos.tgz contains:

kafka-mesos-0.9.1.1.jar
kafka-mesos.sh
kafka_2.10-0.8.2.1.tgz

The schedule starts with no errors that I can find.

Then I start a broker:

curl "http://localhost:7000/api/brokers/add?id=0&cpus=2&mem=2048"
curl "http://localhost:7000/api/brokers/start?id=0"

But the task started by the schedule seems to want to download artifacts from zookeeper.

I0809 04:02:39.610327  2428 fetcher.cpp:214] Fetching URI 'http://zookeeper.service.consul:7000/jar/kafka-mesos-0.9.1.1.jar'
I0809 04:02:39.610530  2428 fetcher.cpp:125] Fetching URI 'http://zookeeper.service.consul:7000/jar/kafka-mesos-0.9.1.1.jar' with os::net
I0809 04:02:39.610546  2428 fetcher.cpp:135] Downloading 'http://zookeeper.service.consul:7000/jar/kafka-mesos-0.9.1.1.jar' to '/mnt/mesos/slaves/20150522-122903-2693333002-5050-7694-S24/frameworks/20150522-122903-2693333002-5050-7694-0000/executors/broker-0-a65870ff-f951-473e-bd26-6e46ae61baf0/runs/502ed53b-85e5-45ef-a62d-fea9f0086fdc/kafka-mesos-0.9.1.1.jar'
E0809 04:06:56.432394  2428 fetcher.cpp:138] Error downloading resource: Couldn't connect to server
Failed to fetch: http://zookeeper.service.consul:7000/jar/kafka-mesos-0.9.1.1.jar
Failed to synchronize with slave (it's probably exited)

It seems like the schedule is getting its state from zookeeper correctly.

$ curl "http://localhost:7000/api/brokers/status?id=0"
{"brokers" : [{"stickiness" : {"period" : "10m", "stopTime" : "2015-08-09 04:12:07.067"}, "id" : "0", "mem" : 2048, "cpus" : 2.0, "heap" : 1024, "failover" : {"delay" : "1m", "maxDelay" : "10m"}, "active" : false}], "frameworkId" : "20150522-122903-2693333002-5050-7694-0000"}

Have I misconfigured something?

Set the broker's zookeeper chroot path

There should be a method to set the chroot path when the brokers use zookeeper. This is normally appended to the zookeeper.connect configuration property.

add support for mesos persistent storage

assigning specific port to brokers

when I tried assigning specific port to broker it failed and declined my offer. port ranges work
port=9092 did not work
port=31000..32000 worked

2015-07-22 16:03:07,015 [Thread-207] INFO ly.stealth.mesos.kafka.Scheduler$ - Declined offers:
52.2.20.235#19580 - broker 1: no suitable port
52.6.91.141#19581 - broker 1: no suitable port
52.4.227.57#19582 - broker 1: no suitable port

status of cluster shows not the bound ip but the slave ip

if you use --bind-address the status shows the slave IP and not the address that the broker is actually bound to, we should maybe get that back in task status so we can update it properly from the scheduler status.

kafka-mesos.sh doesn't work and API does

I know that it is some misconfiguration but I don't see which.

REST API
Add:

[centos@host-04 kafka-mesos]$ curl http://kafka-mesos-scheduler.service.consul:7000/api/brokers/add?id=0
{"brokers" : [{"id" : "0", "mem" : 128, "cpus" : 0.5, "heap" : 128, "failover" : {"delay" : "10s", "maxDelay" : "60s"}, "active" : false}]}

Start:

 [centos@host-04 kafka-mesos]$ curl http://kafka-mesos-  scheduler.service.consul:7000/api/brokers/start?id=0
   {"status" : "started", "ids" : "0"}

Status:

[centos@host-04 kafka-mesos]$ curl http://kafka-mesos-scheduler.service.consul:7000/api/brokers/status?id=0
 {"brokers" : [{"task" : {"hostname" : "host-05", "state" : "running", "slaveId" : "20150625-135133   -2919893514-15050-12781-S0", "executorId" : "broker-0-99345375-fca8-4340-9a0b-e04fb25ae934",   "attributes" : {"node_id" : "host-05"}, "id" : "broker-0-3a671b4e-ee59-4b2a-a

In case of kafka-mesos.sh

[centos@host-04 kafka-mesos]$ kafka-mesos.sh add 1
Error: java.io.IOException: 405 - HTTP method POST is not supported by this URL
[centos@host-04 kafka-mesos]$ kafka-mesos.sh status
Error: java.io.IOException: 405 - HTTP method POST is not supported by this URL
 [centos@host-04 kafka-mesos]$ kafka-mesos.sh status --api http://kafka-mesos-scheduler.service.consul:7000
 Error: java.io.IOException: 405 - HTTP method POST is not supported by this URL

My kafka-mesos.properties:

debug=False
storage=zk:/kafka-mesos
master=zk://zookeeper.service.consul:2181/mesos
zk=zookeeper.service.consul:2181
api=http://kafka-mesos-scheduler.service.consul:7000

What did I miss?

question: What are you doing for storage?

I would like to run kafka in our mesos cluster, but it typically won't run on every node. In that case, what strategies for storage are being employed? Just give each slave a dedicated disk for kafka? Limit kafka to only run on some nodes with kafka disks? If a kafka is (re)started on a node that previous ran a kafka, how will that disk affect the new kafka?

scheduler broker stickiness mechanism usage should be logged as for non-matching offer or reconciling state

We have stickiness mechanism.
If it is used, it is not logged in any form.
This seems to be confusing users.

We have mechanism of logging the reason, why offer is rejected.
Same approach should be used for stickiness.

when the zookeeper host can't resolve, the error message isn't intuitive to why


050-6268-0001/executors/kafka-inbound.67626d94-2cd8-11e5-a715-56847afe9799/runs/49652666-b2a7-4723-8501-483cfc781fd8
2015-07-17 23:06:12,049:25739(0x7fd0bbf41700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=host:2181 sessionTimeout=10000 watcher=0x7fd11d8281e0 sessionId=0 sessionPasswd=<null> context=0x7fd094000930 flags=0
2015-07-17 23:06:22,097:25739(0x7fd0bbf41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory

F0717 23:06:22.097390 25816 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
    @     0x7fd11dad15cd  google::LogMessage::Fail()
    @     0x7fd11dad330c  google::LogMessage::SendToLog()
    @     0x7fd11dad11bc  google::LogMessage::Flush()
    @     0x7fd11dad13c9  google::LogMessage::~LogMessage()
    @     0x7fd11dad2332  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7fd11d828b81  ZooKeeperProcess::initialize()
    @     0x7fd11da7e541  process::ProcessManager::resume()
    @     0x7fd11da7e7ac  process::schedule()
    @     0x7fd1e7a34df3  start_thread
    @     0x7fd1e713901d  __clone

mesos / kafka Goto Github PK

kafka's Introduction

Apache Mesos Repository Has Moved

kafka's People

Contributors

Stargazers

Watchers

Forkers

kafka's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs