Comments (7)
There is clear value to be had from a RabbitMQ node informing the thing that manages its lifecycle - e.g. BOSH, K8S, Chef, etc. - when it is safe to be shutdown. This was the direction in which the initial checks started going, specifically answering the question whether the quorum queues and classic mirrored queues are in sync. It was implied that it is safe to shutdown the node since there are sufficient healthy replicas in the system.
We are now learning that in practice a different command would be more useful, e.g. rabbitmqctl await_safe_replica_state
. I cannot but help see the similarity to a different command, which does the opposite: rabbitmqctl await_startup
. 🤔💡I would like us to go one step further, and imagine the following:
rabbitmqctl shutdown --safe --keep-checking-every 60 --timeout 3600
The --safe
flag would check for the following conditions to be true:
- all queues that should be running on this node are up and running
- the classic mirrored queues with processes on this node are synchronised (if no master or mirror is running on the node being shutdown, we should not take this queue's sync state into account)
- all quorum queues that span this node have peers which are in either follower or leader state
It is easy to see how we can keep improving the semantics of the --safe
flag in the future (e.g. ensure there are no alarms, etc.), but for the first implementation, the above checks feel sufficient.
If any of the --safe
conditions are not met, the --keep-checking-every
flag controls the a periodic check interval which re-runs all checks every n seconds. In its absence, we could default to a sensible interval - 30s?
If the command does not return with exit status 0 within --timeout
seconds, then it returns a non-zero exit status after this period.
While I can see the benefits of having rabbitmq-upgrade await_online_synchronised_mirrors
& rabbitmq-upgrade await_minimum_online_quorum
, as well as the combination of the 2 commands in rabbitmq-upgrade await_safe_replica_state
, I favour the simplicity of the --safe
flag in the context of the shutdown
command. This keeps our API small and allows us to make improvements to the semantics of this behaviour without requiring those that use it to make changes on their end. BOSH & K8S can continue using the same rabbitmqctl shutdown --safe
, and we make the changes that we learn are necessary to respect the "Yes, I am safe to be shutdown" contract.
I would like to hear more opinions on this, especially from those that have experience operating other distributed stateful services, like Cassandra, Oracle, Riak etc. cc @dumbbell @lukebakken @kjnilsson @mkuratczyk
from rabbitmq-cli.
I agree with @lukebakken. Waiting for enough replicas to be on the same page may never happen if there is a steady ingress of messages. The node should kick all clients and stop listenning first.
Also, I disagree with the notion of "pre upgrade". Semantically and from a user point of view, this limits the scope of this state to upgrades only. What if we just want to restart a node for whatever reason? To me, a more appropriate name would be "maintenance mode".
from rabbitmq-cli.
I like the --safe
flag. However, a couple of advantages of a rabbitmq-upgrade pre_upgrade
command would be
- It nicely complements the existing
rabbitmq-upgrade post_upgrade
command - The
rabbitmqctl shutdown
command has today already a flag called--wait
which is different from the blocking / waiting we want by the--safe
flag which might get confusing to the user. - The
rabbitmqctl stop
command would also need the--safe
flag. - In the BOSH lifecycle hook, the natural place to call
pre_upgrade
would be the pre-stop script. In the pre-stop script, we have all the information whether the deployment is going to be deleted in which case there's no need to wait for a safe replicate state. (In the drain script there is a way to find out whether a node is going to be deleted (via theBOSH_JOB_NEXT_STATE
variable), but not whether because it's a scale down or because the deployment is going to be deleted.) - Other tools such as Cassandra also have a separate command which makes sure it's safe to shutdown a node which is going to be upgraded: nodetool drain - "Typically, use this command before upgrading a node to a new version of DSE."
from rabbitmq-cli.
👍 to pre_upgrade
(or something similar). People have requested that RabbitMQ support putting a node into a "drained" state - i.e. close existing connections / channels, do not accept new ones, move queue masters / mirrors off of the node.
from rabbitmq-cli.
An "upgrade mode" and upgrade steps in general involve much more than what's suggested in this issue but we will add pre_upgrade
that will only call the above functions for now, much like post_upgrade
currently only runs queue master rebalancing at the moment.
from rabbitmq-cli.
@dumbbell good point: It's true that a user just might want to restart a node without upgrading it (e.g. by using the bosh restart
command) in which case pre_upgrade
would not be a good naming.
The state of today is that we have the two rabbitmq-diagnostics check_if_node_is_xxx_critical
commands. My understanding is that their original purpose was for deployment tools to know when it is not safe to shutdown for upgrades in which case they would end up waiting until it's safe.
This waiting logic has been implemented in different deployment tools (BOSH cf-rabbitmq-release and K8s operator).
The motivation behind this issue is to avoid this code duplication for the waiting logic by having a new CLI command.
We know that in future, it's best to have something like a proper drain
command (or whatever it will be called) which puts a node into "upgrade mode" / "maintenance mode" / "drained state". This drain
command would - as @lukebakken and @dumbbell suggest - close connections, move queue masters to a different node, etc. Having such as drain
command in future would make the rabbitmq-diagnostics check_if_node_is_xxx_critical
commands redundant from the point of view of the deployment tools that want to upgrade a particular node (in a rolling upgrade fashion).
Having now, as suggested in this issue, an additional command which is about wrapping the two rabbitmq-diagnostics check_if_node_is_xxx_critical
commands and waiting until their exit status becomes 0
doesn't feel right from my point of view if we know that there will be this drain
command at some point in the future.
For me, it's fine to have this code duplication in the deployment tools. It sounds better for me than introducing new commands which might not be needed anymore in the future when this drain
command comes out.
But I think there is no right or wrong. The question is whether the RabbitMQ command line interface follows a minimal interface (which argues this duplication is acceptable) or a humane interface (which argues this duplication is bad).
from rabbitmq-cli.
What exactly goes into a pre-upgrade command or a "maintenance mode" is up for another discussion (update: now there is rabbitmq/rabbitmq-server#2321). The commands suggested in this issue would remove some duplication that'd have to be present in every deployment tool and can be reused one way or another in a more extensive pre-upgrade command.
from rabbitmq-cli.
Related Issues (20)
- Output of the set_parameter command can be misleading HOT 2
- Erlang formatter not working as expected HOT 1
- rabbitmqctl list_unresponsive_queues errors with quorum queues
- Initial pre-shutdown health checks HOT 3
- Option parsing considers values starting with "-" (minus) to be shortcut option lists HOT 16
- formatter for some CLI commands in rabbitmq-diagnostics is creating an UndefinedError HOT 3
- CLI command rabbitmq-diagnostics status can't parse watermark HOT 2
- CLI will not pick up commands from implicitly enabled plugins HOT 5
- rabbitmqctl list_unresponsive_queues logged and error and returned exit code of 0 HOT 2
- Can extended environment variables be taken into account by CLI tools?
- Introduce a command that lists available network interfaces (NICs)
- Visibly deprecate 'rabbitmq-diagnostics node_health_check'
- Make rabbitmq-diagnostics check_if_node_is_mirror_sync_critical special case single node clusters
- CLI definition export incorrectly serialises federation upstreams HOT 2
- Provide a way to evaluate a file with code HOT 5
- Turn auto-completion into a first class command
- RABBITMQ_ERLANG_COOKIE environment variable takes precedence over .erlang.cookie file HOT 1
- Add `--all` to `rabbitmqctl enable_feature_flag` HOT 2
- rename_cluster_node command is broken. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rabbitmq-cli.