The rabbitmq-diagnostics check_if_node_is_quorum_critical</c

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Commands that wait for safe replica state of quorum and classic queues to use before node shutdown about rabbitmq-cli HOT 7 CLOSED

harshac commented on September 27, 2024

Commands that wait for safe replica state of quorum and classic queues to use before node shutdown

from rabbitmq-cli.

Comments (7)

gerhard commented on September 27, 2024 1

There is clear value to be had from a RabbitMQ node informing the thing that manages its lifecycle - e.g. BOSH, K8S, Chef, etc. - when it is safe to be shutdown. This was the direction in which the initial checks started going, specifically answering the question whether the quorum queues and classic mirrored queues are in sync. It was implied that it is safe to shutdown the node since there are sufficient healthy replicas in the system.

We are now learning that in practice a different command would be more useful, e.g. rabbitmqctl await_safe_replica_state. I cannot but help see the similarity to a different command, which does the opposite: rabbitmqctl await_startup. 🤔💡I would like us to go one step further, and imagine the following:

rabbitmqctl shutdown --safe --keep-checking-every 60 --timeout 3600

The --safe flag would check for the following conditions to be true:

all queues that should be running on this node are up and running
the classic mirrored queues with processes on this node are synchronised (if no master or mirror is running on the node being shutdown, we should not take this queue's sync state into account)
all quorum queues that span this node have peers which are in either follower or leader state

It is easy to see how we can keep improving the semantics of the --safe flag in the future (e.g. ensure there are no alarms, etc.), but for the first implementation, the above checks feel sufficient.

If any of the --safe conditions are not met, the --keep-checking-every flag controls the a periodic check interval which re-runs all checks every n seconds. In its absence, we could default to a sensible interval - 30s?

If the command does not return with exit status 0 within --timeout seconds, then it returns a non-zero exit status after this period.

While I can see the benefits of having rabbitmq-upgrade await_online_synchronised_mirrors & rabbitmq-upgrade await_minimum_online_quorum, as well as the combination of the 2 commands in rabbitmq-upgrade await_safe_replica_state, I favour the simplicity of the --safe flag in the context of the shutdown command. This keeps our API small and allows us to make improvements to the semantics of this behaviour without requiring those that use it to make changes on their end. BOSH & K8S can continue using the same rabbitmqctl shutdown --safe, and we make the changes that we learn are necessary to respect the "Yes, I am safe to be shutdown" contract.

I would like to hear more opinions on this, especially from those that have experience operating other distributed stateful services, like Cassandra, Oracle, Riak etc. cc @dumbbell @lukebakken @kjnilsson @mkuratczyk

from rabbitmq-cli.

dumbbell commented on September 27, 2024 1

I agree with @lukebakken. Waiting for enough replicas to be on the same page may never happen if there is a steady ingress of messages. The node should kick all clients and stop listenning first.

Also, I disagree with the notion of "pre upgrade". Semantically and from a user point of view, this limits the scope of this state to upgrades only. What if we just want to restart a node for whatever reason? To me, a more appropriate name would be "maintenance mode".

from rabbitmq-cli.

ansd commented on September 27, 2024

I like the --safe flag. However, a couple of advantages of a rabbitmq-upgrade pre_upgrade command would be

It nicely complements the existing rabbitmq-upgrade post_upgrade command
The rabbitmqctl shutdown command has today already a flag called --wait which is different from the blocking / waiting we want by the --safe flag which might get confusing to the user.
The rabbitmqctl stop command would also need the --safe flag.
In the BOSH lifecycle hook, the natural place to call pre_upgrade would be the pre-stop script. In the pre-stop script, we have all the information whether the deployment is going to be deleted in which case there's no need to wait for a safe replicate state. (In the drain script there is a way to find out whether a node is going to be deleted (via the BOSH_JOB_NEXT_STATE variable), but not whether because it's a scale down or because the deployment is going to be deleted.)
Other tools such as Cassandra also have a separate command which makes sure it's safe to shutdown a node which is going to be upgraded: nodetool drain - "Typically, use this command before upgrading a node to a new version of DSE."

from rabbitmq-cli.

lukebakken commented on September 27, 2024

👍 to pre_upgrade (or something similar). People have requested that RabbitMQ support putting a node into a "drained" state - i.e. close existing connections / channels, do not accept new ones, move queue masters / mirrors off of the node.

from rabbitmq-cli.

michaelklishin commented on September 27, 2024

An "upgrade mode" and upgrade steps in general involve much more than what's suggested in this issue but we will add pre_upgrade that will only call the above functions for now, much like post_upgrade currently only runs queue master rebalancing at the moment.

from rabbitmq-cli.

ansd commented on September 27, 2024

@dumbbell good point: It's true that a user just might want to restart a node without upgrading it (e.g. by using the bosh restart command) in which case pre_upgrade would not be a good naming.

The state of today is that we have the two rabbitmq-diagnostics check_if_node_is_xxx_critical commands. My understanding is that their original purpose was for deployment tools to know when it is not safe to shutdown for upgrades in which case they would end up waiting until it's safe.

This waiting logic has been implemented in different deployment tools (BOSH cf-rabbitmq-release and K8s operator).
The motivation behind this issue is to avoid this code duplication for the waiting logic by having a new CLI command.

We know that in future, it's best to have something like a proper drain command (or whatever it will be called) which puts a node into "upgrade mode" / "maintenance mode" / "drained state". This drain command would - as @lukebakken and @dumbbell suggest - close connections, move queue masters to a different node, etc. Having such as drain command in future would make the rabbitmq-diagnostics check_if_node_is_xxx_critical commands redundant from the point of view of the deployment tools that want to upgrade a particular node (in a rolling upgrade fashion).

Having now, as suggested in this issue, an additional command which is about wrapping the two rabbitmq-diagnostics check_if_node_is_xxx_critical commands and waiting until their exit status becomes 0 doesn't feel right from my point of view if we know that there will be this drain command at some point in the future.

For me, it's fine to have this code duplication in the deployment tools. It sounds better for me than introducing new commands which might not be needed anymore in the future when this drain command comes out.

But I think there is no right or wrong. The question is whether the RabbitMQ command line interface follows a minimal interface (which argues this duplication is acceptable) or a humane interface (which argues this duplication is bad).

from rabbitmq-cli.

michaelklishin commented on September 27, 2024

What exactly goes into a pre-upgrade command or a "maintenance mode" is up for another discussion (update: now there is rabbitmq/rabbitmq-server#2321). The commands suggested in this issue would remove some duplication that'd have to be present in every deployment tool and can be reused one way or another in a more extensive pre-upgrade command.

from rabbitmq-cli.

Commands that wait for safe replica state of quorum and classic queues to use before node shutdown about rabbitmq-cli HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs