GithubHelp home page GithubHelp logo

Comments (5)

steveniemitz avatar steveniemitz commented on June 3, 2024

Is it actually stuck? The default reconciliation timeout is pretty high, 30 minutes I think. Can you attach broker logs from before/after the framework restart?

from kafka.

shangd avatar shangd commented on June 3, 2024

All the brokers are no longer running, but the framework zk state remembered a broker in reconciling state, so when the framework restart it won't start any broker again as long as a single broker state is stuck in reconciling, and it is no longer recoverable because of the above mentioned line. (if any broker state is in reconciling the framework won't schedule any future reconciliation, and since the broker is no longer running the reconciling state is stuck there forever)

from kafka.

shangd avatar shangd commented on June 3, 2024

@steveniemitz Any thought? You can easily recreate the problem by having a framework with 1 running broker, stop the framework first, then kill the broker, manually change the broker state from running to reconciling in zookeeper /kafka-mesos, start the framework and it will be stuck.

I would suggest removing the if part and always do startImpl() in https://github.com/mesos/kafka/blob/master/src/scala/main/ly/stealth/mesos/kafka/scheduler/mesos/TaskReconciler.scala#L124
not sure if there's any side-effect though.

from kafka.

steveniemitz avatar steveniemitz commented on June 3, 2024

I'll have some time to look at this soon, the reconciliation logic is fairly complicated because there are a bunch of edge cases, it's not as simple as just removing that if block.

I'd really like to see the logs from before and after the framework restarts, I'm still confused how you're getting in this state. What version of the framework and mesos are you running?

Also, when it gets in this state, manually stopping it from the CLI should be enough to get it out of reconciling, have you tried that?

from kafka.

shangd avatar shangd commented on June 3, 2024

It's very likely to happen when there is a rolling restart of the mesos slaves, say you have 2 slaves: slave01 and slave02, broker is running on slave01 and framework running on slave02.

  1. slave01 got restarted (broker killed and lost)
  2. framework start reconciling with broker
  3. while it is still reconciling, slave02 got restarted
  4. framework start after slave02 got back, but stuck forever due to the reconciling state.

Here is an example state from /api/broker/list

{
  "brokers": [
    {
      "id": "21",
      "active": true,
      "cpus": 1,
      "mem": 3072,
      "heap": 2048,
      "syslog": false,
      "constraints": "hostname=like:.*slave01.*",
      "options": "",
      "log4jOptions": "",
      "jvmOptions": "",
      "stickiness": {
        "period": "864000s",
        "hostname": "slave01.mycluster.com"
      },
      "failover": {
        "delay": "60s",
        "maxDelay": "10m",
        "failures": 0
      },
      "task": {
        "id": "kafka-21-27ae8058-b0b8-48a3-9d94-eac6e30db749",
        "slaveId": "400ac0f4-9bf6-435c-a227-51a9b559e22d-S3",
        "executorId": "kafka-21-a35bc7ce-8bf2-4016-9f43-6f41aad154d9",
        "hostname": "slave01.mycluster.com",
        "endpoint": "slave01.mycluster.com:10023",
        "attributes": {},
        "state": "reconciling"
      },
      "metrics": {
        "timestamp": 0
      },
      "needsRestart": false
    }
  ]
}

Stopping the broker does not work (/api/broker/stop), since it only changes the active field from true to false, the task section will stick around, and when I start the broker it won't do anything.

The only relevant log I got after framework restart is

2017-06-15 15:16:14,871 WARN           TaskReconciler] Reconcile already in progress, skipping.

The log before the framework restart won't matter, as long as you time it to kill the framework when a broker is reconciling (for any reason), then when you start the framework again it will stuck.

We are using mesos 0.28.2, from my understanding reconciliation is framework driven, so as long as the framework skip it in the startup if block, then the reconciliation can never complete.

from kafka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.