We had several occasions where the mesos slave with running broker got restarted, the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Framework stuck at broker reconciling state about kafka HOT 5 OPEN

shangd commented on June 3, 2024

Framework stuck at broker reconciling state

from kafka.

Comments (5)

steveniemitz commented on June 3, 2024

Is it actually stuck? The default reconciliation timeout is pretty high, 30 minutes I think. Can you attach broker logs from before/after the framework restart?

from kafka.

shangd commented on June 3, 2024

All the brokers are no longer running, but the framework zk state remembered a broker in reconciling state, so when the framework restart it won't start any broker again as long as a single broker state is stuck in reconciling, and it is no longer recoverable because of the above mentioned line. (if any broker state is in reconciling the framework won't schedule any future reconciliation, and since the broker is no longer running the reconciling state is stuck there forever)

from kafka.

shangd commented on June 3, 2024

@steveniemitz Any thought? You can easily recreate the problem by having a framework with 1 running broker, stop the framework first, then kill the broker, manually change the broker state from running to reconciling in zookeeper /kafka-mesos, start the framework and it will be stuck.

I would suggest removing the if part and always do startImpl() in https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/mesos/TaskReconciler.scala#L124
not sure if there's any side-effect though.

from kafka.

steveniemitz commented on June 3, 2024

I'll have some time to look at this soon, the reconciliation logic is fairly complicated because there are a bunch of edge cases, it's not as simple as just removing that if block.

I'd really like to see the logs from before and after the framework restarts, I'm still confused how you're getting in this state. What version of the framework and mesos are you running?

Also, when it gets in this state, manually stopping it from the CLI should be enough to get it out of reconciling, have you tried that?

from kafka.

shangd commented on June 3, 2024

It's very likely to happen when there is a rolling restart of the mesos slaves, say you have 2 slaves: slave01 and slave02, broker is running on slave01 and framework running on slave02.

slave01 got restarted (broker killed and lost)
framework start reconciling with broker
while it is still reconciling, slave02 got restarted
framework start after slave02 got back, but stuck forever due to the reconciling state.

Here is an example state from /api/broker/list

{
  "brokers": [
    {
      "id": "21",
      "active": true,
      "cpus": 1,
      "mem": 3072,
      "heap": 2048,
      "syslog": false,
      "constraints": "hostname=like:.*slave01.*",
      "options": "",
      "log4jOptions": "",
      "jvmOptions": "",
      "stickiness": {
        "period": "864000s",
        "hostname": "slave01.mycluster.com"
      },
      "failover": {
        "delay": "60s",
        "maxDelay": "10m",
        "failures": 0
      },
      "task": {
        "id": "kafka-21-27ae8058-b0b8-48a3-9d94-eac6e30db749",
        "slaveId": "400ac0f4-9bf6-435c-a227-51a9b559e22d-S3",
        "executorId": "kafka-21-a35bc7ce-8bf2-4016-9f43-6f41aad154d9",
        "hostname": "slave01.mycluster.com",
        "endpoint": "slave01.mycluster.com:10023",
        "attributes": {},
        "state": "reconciling"
      },
      "metrics": {
        "timestamp": 0
      },
      "needsRestart": false
    }
  ]
}

Stopping the broker does not work (/api/broker/stop), since it only changes the active field from true to false, the task section will stick around, and when I start the broker it won't do anything.

The only relevant log I got after framework restart is

2017-06-15 15:16:14,871 WARN           TaskReconciler] Reconcile already in progress, skipping.

The log before the framework restart won't matter, as long as you time it to kill the framework when a broker is reconciling (for any reason), then when you start the framework again it will stuck.

We are using mesos 0.28.2, from my understanding reconciliation is framework driven, so as long as the framework skip it in the startup if block, then the reconciliation can never complete.

from kafka.

Framework stuck at broker reconciling state about kafka HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs