GithubHelp home page GithubHelp logo

Comments (7)

gerhard avatar gerhard commented on September 27, 2024 1

There is clear value to be had from a RabbitMQ node informing the thing that manages its lifecycle - e.g. BOSH, K8S, Chef, etc. - when it is safe to be shutdown. This was the direction in which the initial checks started going, specifically answering the question whether the quorum queues and classic mirrored queues are in sync. It was implied that it is safe to shutdown the node since there are sufficient healthy replicas in the system.

We are now learning that in practice a different command would be more useful, e.g. rabbitmqctl await_safe_replica_state. I cannot but help see the similarity to a different command, which does the opposite: rabbitmqctl await_startup. 🤔💡I would like us to go one step further, and imagine the following:

rabbitmqctl shutdown --safe --keep-checking-every 60 --timeout 3600

The --safe flag would check for the following conditions to be true:

  • all queues that should be running on this node are up and running
  • the classic mirrored queues with processes on this node are synchronised (if no master or mirror is running on the node being shutdown, we should not take this queue's sync state into account)
  • all quorum queues that span this node have peers which are in either follower or leader state

It is easy to see how we can keep improving the semantics of the --safe flag in the future (e.g. ensure there are no alarms, etc.), but for the first implementation, the above checks feel sufficient.

If any of the --safe conditions are not met, the --keep-checking-every flag controls the a periodic check interval which re-runs all checks every n seconds. In its absence, we could default to a sensible interval - 30s?

If the command does not return with exit status 0 within --timeout seconds, then it returns a non-zero exit status after this period.

While I can see the benefits of having rabbitmq-upgrade await_online_synchronised_mirrors & rabbitmq-upgrade await_minimum_online_quorum, as well as the combination of the 2 commands in rabbitmq-upgrade await_safe_replica_state, I favour the simplicity of the --safe flag in the context of the shutdown command. This keeps our API small and allows us to make improvements to the semantics of this behaviour without requiring those that use it to make changes on their end. BOSH & K8S can continue using the same rabbitmqctl shutdown --safe, and we make the changes that we learn are necessary to respect the "Yes, I am safe to be shutdown" contract.

I would like to hear more opinions on this, especially from those that have experience operating other distributed stateful services, like Cassandra, Oracle, Riak etc. cc @dumbbell @lukebakken @kjnilsson @mkuratczyk

from rabbitmq-cli.

dumbbell avatar dumbbell commented on September 27, 2024 1

I agree with @lukebakken. Waiting for enough replicas to be on the same page may never happen if there is a steady ingress of messages. The node should kick all clients and stop listenning first.

Also, I disagree with the notion of "pre upgrade". Semantically and from a user point of view, this limits the scope of this state to upgrades only. What if we just want to restart a node for whatever reason? To me, a more appropriate name would be "maintenance mode".

from rabbitmq-cli.

ansd avatar ansd commented on September 27, 2024

I like the --safe flag. However, a couple of advantages of a rabbitmq-upgrade pre_upgrade command would be

  • It nicely complements the existing rabbitmq-upgrade post_upgrade command
  • The rabbitmqctl shutdown command has today already a flag called --wait which is different from the blocking / waiting we want by the --safe flag which might get confusing to the user.
  • The rabbitmqctl stop command would also need the --safe flag.
  • In the BOSH lifecycle hook, the natural place to call pre_upgrade would be the pre-stop script. In the pre-stop script, we have all the information whether the deployment is going to be deleted in which case there's no need to wait for a safe replicate state. (In the drain script there is a way to find out whether a node is going to be deleted (via the BOSH_JOB_NEXT_STATE variable), but not whether because it's a scale down or because the deployment is going to be deleted.)
  • Other tools such as Cassandra also have a separate command which makes sure it's safe to shutdown a node which is going to be upgraded: nodetool drain - "Typically, use this command before upgrading a node to a new version of DSE."

from rabbitmq-cli.

lukebakken avatar lukebakken commented on September 27, 2024

👍 to pre_upgrade (or something similar). People have requested that RabbitMQ support putting a node into a "drained" state - i.e. close existing connections / channels, do not accept new ones, move queue masters / mirrors off of the node.

from rabbitmq-cli.

michaelklishin avatar michaelklishin commented on September 27, 2024

An "upgrade mode" and upgrade steps in general involve much more than what's suggested in this issue but we will add pre_upgrade that will only call the above functions for now, much like post_upgrade currently only runs queue master rebalancing at the moment.

from rabbitmq-cli.

ansd avatar ansd commented on September 27, 2024

@dumbbell good point: It's true that a user just might want to restart a node without upgrading it (e.g. by using the bosh restart command) in which case pre_upgrade would not be a good naming.

The state of today is that we have the two rabbitmq-diagnostics check_if_node_is_xxx_critical commands. My understanding is that their original purpose was for deployment tools to know when it is not safe to shutdown for upgrades in which case they would end up waiting until it's safe.

This waiting logic has been implemented in different deployment tools (BOSH cf-rabbitmq-release and K8s operator).
The motivation behind this issue is to avoid this code duplication for the waiting logic by having a new CLI command.

We know that in future, it's best to have something like a proper drain command (or whatever it will be called) which puts a node into "upgrade mode" / "maintenance mode" / "drained state". This drain command would - as @lukebakken and @dumbbell suggest - close connections, move queue masters to a different node, etc. Having such as drain command in future would make the rabbitmq-diagnostics check_if_node_is_xxx_critical commands redundant from the point of view of the deployment tools that want to upgrade a particular node (in a rolling upgrade fashion).

Having now, as suggested in this issue, an additional command which is about wrapping the two rabbitmq-diagnostics check_if_node_is_xxx_critical commands and waiting until their exit status becomes 0 doesn't feel right from my point of view if we know that there will be this drain command at some point in the future.

For me, it's fine to have this code duplication in the deployment tools. It sounds better for me than introducing new commands which might not be needed anymore in the future when this drain command comes out.

But I think there is no right or wrong. The question is whether the RabbitMQ command line interface follows a minimal interface (which argues this duplication is acceptable) or a humane interface (which argues this duplication is bad).

from rabbitmq-cli.

michaelklishin avatar michaelklishin commented on September 27, 2024

What exactly goes into a pre-upgrade command or a "maintenance mode" is up for another discussion (update: now there is rabbitmq/rabbitmq-server#2321). The commands suggested in this issue would remove some duplication that'd have to be present in every deployment tool and can be reused one way or another in a more extensive pre-upgrade command.

from rabbitmq-cli.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.