This guide covers topics related to RabbitMQ installation upgrades.
It is important to consider a number of things before upgrading RabbitMQ.
Changes between RabbitMQ versions are documented in the change log.
There are two major upgrade scenarios that are covered in this guide: a single node and a cluster, and two most commonly used strategies:
An in-place upgrade usually involves the following steps performed by a deployment tool or manually by an operator. Each step is covered in more detail later in this guide. An intentionally oversimplified list of steps would include:
Rolling upgrades between certain versions are not supported. Full Stop Upgrades covers the process for those cases.
The Blue/Green deployment strategy offers the benefit of making the upgrade process safer at the cost of temporary increasing infrastructure footprint. The safety aspect comes from the fact that the operator can abort an upgrade by switching applications back to the existing cluster.
The rest of the guide covers each upgrade step in more details.
When an upgrade jumps multiple release series (e.g. goes from 3.4.x to 3.6.x), it may be necessary to perform an intermediate upgrade first. For example, when upgrading from 3.2.x to 3.7.x, it would be necessary to first upgrade to 3.6.x and then upgrade to 3.7.0.
A full cluster stop may be required for feature version upgrades.
Current release series upgrade compatibility with rolling upgrade:
From | To |
---|---|
3.10.x | 3.11.x |
3.9.x | 3.10.x |
3.8.x | 3.9.x |
3.7.18 | 3.8.x |
Current release series upgrade compatibility with full stop upgrade:
From | To |
---|---|
3.10.x | 3.11.x |
3.9.x | 3.10.x |
3.8.x | 3.9.x |
3.7.27 | 3.9.x |
3.6.x | 3.8.x |
3.6.x | 3.7.x |
3.5.x | 3.7.x |
=< 3.4.x | 3.6.16 |
3.7.18 and later 3.7.x versions support rolling upgrades to 3.8.x using feature flags.
We recommend that you upgrade Erlang together with RabbitMQ. Please refer to the Erlang Version Requirements guide.
Priority queue on disk data currently cannot be migrated in place between 3.6 and 3.7 (a later series). If an upgrade is performed in place, such queues would start empty (without any messages) after node restart.
To migrate an environment with priority queues and preserve their content (messages), a blue-green upgrade strategy should be used.
Unless otherwise specified in release notes, RabbitMQ plugin API introduces no breaking changes within a release series (e.g. between 3.6.11 and 3.6.16). If upgrading to a new minor version (e.g. 3.7.0), plugin must be upgraded to their versions that support the new RabbitMQ version series.
In rare cases patch versions of RabbitMQ can break some plugin APIs. Such cases will be documented in the breaking changes section of the release notes document.
Community plugins page contains information on RabbitMQ version support for plugins not included into the RabbitMQ distribution.
RabbitMQ management plugin comes with a Web application that runs in the browser. Clear browser cache, local storage, session storage and cookies after upgrade is recommended.
Sometimes a new feature release drops a plugin or multiple plugins from the distribution. For example, rabbitmq_management_visualiser no longer ships with RabbitMQ as of 3.7.0. Such plugins must be disabled before the upgrade. A node that has a missing plugin enabled will fail to start.
Different versions of RabbitMQ can have different resource usage. That should be taken into account before upgrading: make sure there's enough capacity to run the workload with the new version. Always consult with the release notes of all versions between the one currently deployed and the target one in order to find out about changes which could impact your workload and resource usage.
In RabbitMQ versions before 3.6.7 all management stats in a cluster were collected on a single node (the stats DB node). This put a lot of additional load on this node. Starting with RabbitMQ 3.6.7 each cluster node stores its own stats. It means that metrics (e.g. rates) for each node are stored and calculated locally. Therefore all nodes will consume a bit more memory and CPU resources to handle that. The benefit is that there is no single overloaded stats node.
When an HTTP API request comes in, the stats are aggregated on the node which handles the request. If HTTP API requests are not distributed between cluster nodes, it can put some additional load on that node's CPU and memory resources. In practice stats database-related overload is a thing of the past.
Individual node resource usage change is workload-specific. The best way to measure it is by reproducing a comparable workload in a temporary QA environment before upgrading production systems.
In RabbitMQ versions before 3.6.11 memory used by the node was calculated using a runtime-provided mechanism that's not very precise. The actual memory allocated by the OS process usually was higher.
Starting with RabbitMQ 3.6.11 a number of strategies is available. On Linux, MacOS, and BSD systems, operating system facilities will be used to compute the total amount of memory allocated by the node. It is possible to go back to the previous strategy, although that's not recommended. See the Memory Usage guide for details.
After upgrading from a version prior to 3.6.11 to 3.6.11 or later, the memory usage reported by the management UI will increase. The effective node memory footprint didn't actually change but the calculation is now more accurate and no longer underreports.
Nodes that often hovered around their RAM high watermark will see more frequent memory alarms and publishers will be blocked more often. On the upside this means that RabbitMQ nodes are less likely to be killed by the out-of-memory (OOM) mechanism of the OS.
When upgrading a single node installation, simply stop the node, install a new version and start it back. The node will perform all the necessary local database migrations on start. Depending on the nature of migrations and data set size this can take some time.
A data directory backup is performed before applying any migrations. The backup is deleted after successful upgrade. Upgrades therefore can temporarily double the amount of disk space node's data directory uses.
Client (application) connections will be dropped when the node stops. Applications need to be prepared to handle this and reconnect.
With some distributions (e.g. the generic binary UNIX) you can install a newer version of RabbitMQ without removing or replacing the old one, which can make upgrade faster. You should make sure the new version uses the same data directory.
RabbitMQ does not support downgrades; it's strongly advised to back node's data directory up before upgrading.
Depending on what versions are involved in an upgrade, RabbitMQ cluster may provide an opportunity to perform upgrades without cluster downtime using a procedure known as rolling upgrade. A rolling upgrade is when nodes are stopped, upgraded and restarted one-by-one, with the rest of the cluster still running while each node is being upgraded.
If rolling upgrades are not possible, the entire cluster should be stopped, then restarted. This is referred to as a full stop upgrade.
Client (application) connections will be dropped when each node stops. Applications need to be prepared to handle this and reconnect.
Rolling upgrades are possible only between compatible RabbitMQ and Erlang versions.
RabbitMQ 3.8.0 comes with a feature flag subsystem which is responsible for determining if two versions of RabbitMQ are compatible. If they are, then two nodes with different versions can live in the same cluster: this allows a rolling upgrade of cluster members without shutting down the cluster entirely.
The upgrade from RabbitMQ 3.7.x to 3.8.x is also permitted, but not from older minor or major versions.
To learn more, please read the feature flags documentation.
With RabbitMQ up-to and including 3.7.x, when upgrading from one major or minor version of RabbitMQ to another (i.e. from 3.0.x to 3.1.x, or from 2.x.x to 3.x.x), the whole cluster must be taken down for the upgrade. Clusters that include nodes that run different release series are not supported.
Rolling upgrades from one patch version to another (i.e. from 3.6.x to 3.6.y) are supported except when indicated otherwise in the release notes. It is strongly recommended to consult release notes before upgrading.
Some patch releases known to require a cluster-wide restart:
A RabbitMQ node will fail to [re-]join a peer running an incompatible version.
When upgrading Erlang it's advised to run all nodes on the same major series (e.g. 19.x or 20.x). Even though it is possible to run a cluster with mixed Erlang versions, they can have incompatibilities that will affect cluster stability.
Running mixed Erlang versions can result in internal inter-node communication protocol incompatibilities. When a node detects such an incompatibility it will refuse to join its peer (cluster).
Upgrading to a new minor or patch version of Erlang usually can be done using a rolling upgrade.
It is important to let the node being upgraded to fully start and sync all data from its peers before proceeding to upgrade the next one. You can check for that via the management UI. Confirm that:
During a rolling upgrade, client connection recovery will make sure that connections are rebalanced. Primary queue replicas will migrate to other nodes. In practice this will put more load on the remaining cluster nodes. This can impact performance and stability of the cluster. It's not recommended to perform rolling upgrades under high load.
Starting with RabbitMQ 3.8.8, nodes can be put into maintenance mode to prepare them for shutdown during rolling upgrades.
Maintenance mode is a special node operation mode introduced in latest RabbitMQ releases. The mode is explicitly turned on and off by the operator using a bunch of new CLI commands covered below. For mixed-version cluster compatibility, this feature must be enabled using a feature flag once all cluster members have been upgraded to a version that supports it:
rabbitmqctl enable_feature_flag maintenance_mode_status
To put a node under maintenance, use rabbitmq-upgrade drain:
rabbitmq-upgrade drain
As all other CLI commands, this command can be invoked against an arbitrary node (including remote ones) using the -n switch:
# puts node rabbit@node2.cluster.rabbitmq.svc into maintenance mode rabbitmq-upgrade drain -n rabbit@node2.cluster.rabbitmq.svc
When a node is in maintenance mode, it will not be available for serving client traffic and will try to transfer as many of its responsibilities as practically possible and safe.
Currently this involves the following steps:
A node in maintenance mode will not be considered for new primary queue replica placement, regardless of queue type and the queue leader locator policy used.
This feature is expected to evolve based on the feedback from RabbitMQ operators, users, and RabbitMQ core team's own experience with it.
A node in maintenance mode is expected to be shut down, upgraded or reconfigured, and restarted in a short period of time (say, 5-30 minutes). Nodes are not expected to be running in this mode for long periods of time.
A node in maintenance mode can be revived, that is, brought back into its regular operational state, using rabbitmq-upgrade revive:
rabbitmq-upgrade revive
As all other CLI commands, this command can be invoked against an arbitrary node (including remote ones) using the -n switch:
# revives node rabbit@node2.cluster.rabbitmq.svc from maintenance rabbitmq-upgrade revive -n rabbit@node2.cluster.rabbitmq.svc
When a node is revived or restarted (e.g. after an upgrade), it will again accept client connections and be considered for primary queue replica placements.
It will not recover previous client connections as RabbitMQ never initiates connections to clients, but clients will be able to reconnect to it.
If the maintenance mode status feature flag is enabled, node maintenance status will be reported in rabbitmq-diagnostics status and rabbitmq-diagnostics cluster_status.
If the feature flag is not enabled, the status will be reported as unknown.
Here's an example rabbitmq-diagnostics status output of a node under maintenance:
Status of node rabbit@hostname ... Runtime OS PID: 25531 OS: macOS Uptime (seconds): 48540 Is under maintenance?: true # ...
Compare this to this example output from a node in regular operating mode:
Status of node rabbit@hostname ... Runtime OS PID: 25531 OS: macOS Uptime (seconds): 48540 Is under maintenance?: false # ...
When an entire cluster is stopped for upgrade, the order in which nodes are stopped and started is important.
RabbitMQ will automatically update its data directory if necessary when upgrading between major or minor versions. In a cluster, this task is performed by the first disc node to be started (the "upgrader" node).
Therefore when upgrading a RabbitMQ cluster using the "full stop" method, a disc node must start first. Starting a RAM node first is not going to work: the node will log an error and stop.
During an upgrade, the last disc node to go down must be the first node to be brought online. Otherwise the started node will emit an error message and fail to start up. Unlike an ordinary cluster restart, upgrading nodes will not wait for the last disc node to come back online.
While not strictly necessary, it is a good idea to decide ahead of time which disc node will be the upgrader, stop that node last, and start it first. Otherwise changes to the cluster configuration that were made between the upgrader node stopping and the last node stopping will be lost.
There are some minor things to consider during upgrade process when stopping and restarting nodes.
Known bugs in the Erlang runtime can affect upgrades. Most common issues involve nodes hanging during shutdown, which blocks subsequent upgrade steps:
Please note that both issues affect old and no longer supported version of Erlang.
A node that suffered from the above bugs will fail to shut down and stop responding to inbound connections, including those of CLI tools. Such node's OS process has to be terminated (e.g. using kill -9 on UNIX systems).
Please note that in the presence of many messages it can take a node several minutes to shut down cleanly, so if a node responds to CLI tool commands it could be performing various shutdown activities such as moving enqueued messages to disk.
The following commands can be used to verify whether a node is experience the above bugs. An affected node will not respond to CLI connections in a reasonable amount of time when performing the following basic commands:
rabbitmq-diagnostics ping rabbitmq-diagnostics status
Quorum queues depend on a quorum of nodes to be online for any queue operations to succeed. This includes successful new leader election should a cluster node that hosts some leaders shut down.
In the context of rolling upgrades this means that a quorum of nodes must be present at all times during an upgrade. If this is not the case, quorum queues will become unavailable and will be not able to satisfy their data safety guarantees.
Latest RabbitMQ releases provide a health check command that would fail should any quorum queues on the target node lose their quorum in case the node was to be shut down:
# Exits with a non-zero code if one or more quorum queues will lose online quorum # should target node be shut down rabbitmq-diagnostics check_if_node_is_quorum_critical
For example, consider a three node cluster with nodes A, B, and C. If node B is currently down and there are quorum queues with leader replica on node A, this check will fail if executed against node A. When node B comes back online, the same check would succeed because the quorum queues with leader on node A would have a quorum of replicas online.
Quorum queue quorum state can be verified by listing queues in the management UI or using rabbitmq-queues:
rabbitmq-queues -n rabbit@to-be-stopped quorum_status <queue name>
In environments that use classic mirrored queues, it is important to make sure that all mirrored queues on a node have a synchronised follower replica (mirror) before stopping that node.
RabbitMQ will not promote unsynchronised queue mirrors on controlled queue leader shutdown when default promotion settings are used. However if a queue leader encounters any errors during shutdown, an unsynchronised queue mirror might still be promoted. It is generally safer option to synchronise all classic mirrored queues with replicas on a node before shutting the node down.
Latest RabbitMQ releases provide a health check command that would fail should any classic mirrored queues on the target node have no synchronised mirrors:
# Exits with a non-zero code if target node hosts leader replica of at least one queue # that has out-of-sync mirror. rabbitmq-diagnostics check_if_node_is_mirror_sync_critical
For example, consider a three node cluster with nodes A, B, and C. If there are classic mirrored queues with the only synchronised replica on node A (the leader), this check will fail if executed against node A. When one of other replicas is re-synchronised, the same check would succeed because there would be at least one replica suitable for promotion.
Classic mirrored queue replica state can be verified by listing queues in the management UI or using rabbitmqctl:
# For queues with non-empty `mirror_pids`, you must have at least one # `synchronised_mirror_pids`. # # Note that mirror_pids is a new field alias introduced in RabbitMQ 3.11.4 rabbitmqctl -n rabbit@to-be-stopped list_queues --local name mirror_pids synchronised_mirror_pids
If there are unsynchronised queues, either enable automatic synchronisation or trigger it using rabbitmqctl manually.
RabbitMQ shutdown process will not wait for queues to be synchronised if a synchronisation operation is in progress.
Some upgrade scenarios can cause mirrored queue leaders to be unevenly distributed between nodes in a cluster. This will put more load on the nodes with more queue leaders. For example a full-stop upgrade will make all queue leaders migrate to the "upgrader" node - the one stopped last and started first. A rolling upgrade of three nodes with two mirrors will also cause all queue leaders to be on the same node.
You can move a queue leader for a queue using a temporary policy with ha-mode: nodes and ha-params: [<node>] The policy can be created via management UI or rabbitmqctl command:
rabbitmqctl set_policy --apply-to queues --priority 100 move-my-queue '^<queue>$;' '{"ha-mode":"nodes", "ha-params":["<new-master-node>"]}' rabbitmqctl clear_policy move-my-queue
A queue leader rebalancing script is available. It rebalances queue leaders for all queues.
The script has certain assumptions (e.g. the default node name) and can fail to run on some installations. The script should be considered experimental. Run it in a non-production environment first.
A queue leader rebalance command is available. It rebalances queue leaders for all queues, or those that match the given name pattern. queue leaders for mirrored queues and leaders for quorum queues are also rebalanced in the post-upgrade command.
There is also a third-party plugin that rebalances queue leaders. The plugin has some additional configuration and reporting tools, but is not supported or verified by the RabbitMQ team. Use at your own risk.
In order to reduce or eliminate the downtime, applications (both producers and consumers) should be able to cope with a server-initiated connection close. Some client libraries offer automatic connection recovery to help with this:
In most client libraries there is a way to react to a connection closure, for example:
The recovery procedure for many applications follows the same steps:
Topology recovery includes the following actions, performed for every channel:
This algorithm covers the majority of use cases and is what the aforementioned automatic recovery feature implements.
During a rolling upgrade when a node is stopped, clients connected to this node will be disconnected using a server-sent connection.close method and should reconnect to a different node. This can be achieved by using a load balancer or proxy in front of the cluster or by specifying multiple server hosts if client library supports this feature.
Many client libraries libraries support host lists, for example:
If the value of the environment variable COMPUTERNAME does not equal HOSTNAME (upper vs lower case, or other differences) please see the Windows Quirks guide for instructions on how to upgrade RabbitMQ.
Patch releases contain bugfixes and features which do not break compatibility with plugins and clusters. Rarely there are exceptions to this statement: when this happens, the release notes will indicate when two patch releases are incompatible.
Minor version releases contain new features and bugfixes which do not fit a patch release.
As soon as a new minor version is released (e.g. 3.7.0), previous version series (3.6) will have patch releases for critical bug fixes only.
There will be no new patch releases for versions after EOL.
Version 3.5.x reached its end of life on 2017-09-11, 3.5.8 is the last patch for 3.5. It's recommended to always upgrade at least to the latest patch release in a series.
The release notes may indicate specific additional upgrade steps. Always consult with the release notes of all versions between the one currently deployed and the target one.
Some upgrade paths, e.g. from 3.4.x to 3.7.x, will require an intermediate upgrade. See the RabbitMQ Version Upgradability section above.
Check if the current Erlang version is supported by the new RabbitMQ version. See the Erlang Version Requirements guide. If not, Erlang should be upgraded together with RabbitMQ.
It's generally recommended to upgrade to the latest Erlang version supported to get all the latest bugfixes.
If you are using Debian or RPM packages, you must ensure that all dependencies are available. In particular, the correct version of Erlang. You may have to setup additional third-party package repositories to achieve that.
Please read recommendations for Debian-based and RPM-based distributions to find the appropriate repositories for Erlang.
It can be possible to do a rolling upgrade, if Erlang version and RabbitMQ version changes support it.
See the Upgrading Multiple Nodes section above.
Make sure nodes are healthy and there are no network partition or disk or memory alarms in effect.
RabbitMQ management UI, CLI tools or HTTP API can be used for assessing the health of the system.
The overview page in the management UI displays effective RabbitMQ and Erlang versions, multiple cluster-wide metrics and rates. From this page ensure that all nodes are running and they are all "green" (w.r.t. file descriptors, memory, disk space, and so on).
We recommend recording the number of durable queues, the number of messages they hold and other pieces of information about the topology that are relevant. This data will help verify that the system operates within reasonable parameters after the upgrade.
Use node health checks to vet individual nodes.
Queues in flow state or blocked/blocking connections might be ok, depending on your workload. It's up to you to determine if this is a normal situation or if the cluster is under unexpected load and thus, decide if it's safe to continue with the upgrade.
However, if there are queues in an undefined state (a.k.a. NaN or "ghost" queues), you should first start by understanding what is wrong before starting an upgrade.
The upgrade process can require additional resources. Make sure there are enough resources available to proceed, in particular free memory and free disk space.
It's recommended to have at least half of the system memory free before the upgrade. Default memory watermark is 0.4 so it should be ok, but you should still double-check. Starting with RabbitMQ 3.6.11 the way nodes calculate their total RAM consumption has changed.
When upgrading from an earlier version, it is required that the node has enough free disk space to fit at least a full copy of the node data directory. Nodes create backups before proceeding to upgrade their database. If disk space is depleted, the node will abort upgrading and may fail to start until the data directory is restored from the backup.
For example, if you have 10 GiB of free system memory and the Erlang process (i.e. beam.smp) memory footprint is around 6 GiB, then it can be unsafe to proceed. Likewise w.r.t. disk if you have 10 GiB of free space and the data directory (e.g. /var/lib/rabbitmq) takes 10 GiB.
When upgrading a cluster using the rolling upgrade strategy, be aware that queues and connections can migrate to other nodes during the upgrade.
If queues are mirrored to a subset of the cluster only (as opposed to all nodes), new mirrors will be created on running nodes when the to-be-upgraded node shuts down. If clients support connections recovery and can connect to different nodes, they will reconnect to the nodes that are still running. If clients are configured to create exclusive queues, these queues might be recreated on different nodes after client reconnection.
To handle such migrations, make sure you have enough spare resources on the remaining nodes so they can handle the extra load. Depending on the load balancing strategy all the connections from the stopped node can go to a single node, so it should be able to handle up to twice as many. It's generally a good practice to run a cluster with N+1 redundancy (resource-wise), so you always have a capacity to handle a single node being down.
It's always good to have a backup before upgrading. See backup guide for instructions.
To make a proper backup, you may need to stop the entire cluster. Depending on your use case, you may make the backup while the cluster is stopped for the upgrade.
It's recommended to upgrade Erlang version together with RabbitMQ, because both actions require restart and recent RabbitMQ work better with recent Erlang.
Depending on cluster configuration, you can use either single node upgrade, rolling upgrade or full-stop upgrade strategy.
Like you did before the upgrade, verify the health and data to make sure your RabbitMQ nodes are in good shape and the service is running again.
If the new version provides new feature flags, you can now enable them if you upgraded all nodes and you are sure you do not want to rollback. See the feature flags guide.
If you have questions about the contents of this guide or any other topic related to RabbitMQ, don't hesitate to ask them on the RabbitMQ mailing list.
If you'd like to contribute an improvement to the site, its source is available on GitHub. Simply fork the repository and submit a pull request. Thank you!