Comments (5)
Is it actually stuck? The default reconciliation timeout is pretty high, 30 minutes I think. Can you attach broker logs from before/after the framework restart?
from kafka.
All the brokers are no longer running, but the framework zk state remembered a broker in reconciling
state, so when the framework restart it won't start any broker again as long as a single broker state is stuck in reconciling
, and it is no longer recoverable because of the above mentioned line. (if any broker state is in reconciling
the framework won't schedule any future reconciliation, and since the broker is no longer running the reconciling
state is stuck there forever)
from kafka.
@steveniemitz Any thought? You can easily recreate the problem by having a framework with 1 running broker, stop the framework first, then kill the broker, manually change the broker state from running
to reconciling
in zookeeper /kafka-mesos
, start the framework and it will be stuck.
I would suggest removing the if
part and always do startImpl()
in https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/mesos/TaskReconciler.scala#L124
not sure if there's any side-effect though.
from kafka.
I'll have some time to look at this soon, the reconciliation logic is fairly complicated because there are a bunch of edge cases, it's not as simple as just removing that if block.
I'd really like to see the logs from before and after the framework restarts, I'm still confused how you're getting in this state. What version of the framework and mesos are you running?
Also, when it gets in this state, manually stopping it from the CLI should be enough to get it out of reconciling, have you tried that?
from kafka.
It's very likely to happen when there is a rolling restart of the mesos slaves, say you have 2 slaves: slave01 and slave02, broker is running on slave01 and framework running on slave02.
- slave01 got restarted (broker killed and lost)
- framework start reconciling with broker
- while it is still reconciling, slave02 got restarted
- framework start after slave02 got back, but stuck forever due to the reconciling state.
Here is an example state from /api/broker/list
{
"brokers": [
{
"id": "21",
"active": true,
"cpus": 1,
"mem": 3072,
"heap": 2048,
"syslog": false,
"constraints": "hostname=like:.*slave01.*",
"options": "",
"log4jOptions": "",
"jvmOptions": "",
"stickiness": {
"period": "864000s",
"hostname": "slave01.mycluster.com"
},
"failover": {
"delay": "60s",
"maxDelay": "10m",
"failures": 0
},
"task": {
"id": "kafka-21-27ae8058-b0b8-48a3-9d94-eac6e30db749",
"slaveId": "400ac0f4-9bf6-435c-a227-51a9b559e22d-S3",
"executorId": "kafka-21-a35bc7ce-8bf2-4016-9f43-6f41aad154d9",
"hostname": "slave01.mycluster.com",
"endpoint": "slave01.mycluster.com:10023",
"attributes": {},
"state": "reconciling"
},
"metrics": {
"timestamp": 0
},
"needsRestart": false
}
]
}
Stopping the broker does not work (/api/broker/stop
), since it only changes the active
field from true
to false
, the task
section will stick around, and when I start the broker it won't do anything.
The only relevant log I got after framework restart is
2017-06-15 15:16:14,871 WARN TaskReconciler] Reconcile already in progress, skipping.
The log before the framework restart won't matter, as long as you time it to kill the framework when a broker is reconciling
(for any reason), then when you start the framework again it will stuck.
We are using mesos 0.28.2, from my understanding reconciliation is framework driven, so as long as the framework skip it in the startup if
block, then the reconciliation can never complete.
from kafka.
Related Issues (20)
- Host advertisement doesn't seem to work on Kafka 0.9.0+
- Unable to Start Brokers Executor - java.lang.reflect.InvocationTargetException HOT 1
- Can't modify bind-address of broker HOT 2
- Security support for mesos-kafka HOT 7
- very quick offerRescinded, brokers not starting HOT 11
- Not able to start broker with Vagrant example HOT 3
- Topic rebalance REST API: NPE for get status
- How to add kafka connectors
- Multiple version of kafka same schedular HOT 2
- Does adding topic have to be through the framework HOT 1
- Exception when creating topics HOT 1
- Question: Differerence between mesos-kafka & DCOS-kafka-service? HOT 1
- More config support for scheduler from Apache Kafka? HOT 6
- Error: java.io.IOException: 400 - java.lang.NullPointerException, when broker start
- Missing broker task attribute 'endpoint' on broker startup / reconciliation HOT 3
- "listeners" manipulation when bind-address set breaks ability to use SSL
- Build Failing for openJdk7 & KafkaVer:0.10.2.0
- Will Kafka 0.11.x be supported? HOT 10
- broker start timeout when Mesos agent start with gpu/nvidia isolation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kafka.