https://blueprints.launchpad.net/fuel/+spec/rabbitmq-disable-mirroring-for-rpc
RabbitMQ restarts caused by too high load impact OpenStack stability. In order to reduce load on RabbitMQ, it is proposed to disable mirroring for RPC queues and leave it enabled only for Ceilometer queues.
Note: the feature will be expeimental in 8.0 and will be disabled by default.
When a significant load is put on OpenStack, it in turn causes high load on RabbitMQ. As a result, some nodes in RabbitMQ cluster fail and that causes downtime for Nova, Neutron, Cinder and other OpenStack services while they reconnect to the remaining RabbitMQ nodes.
We observe this issue on scale, especially when OpenStack is deployed with DVR enabled and the boot_and_delete_server_with_secgroups Rally test is run on 200 nodes.
In order to mitigate the described problem, it is proposed to disable mirroring for OpenStack RPC queues. That way RPC messages will cause smaller impact on RabbitMQ as it will not need to replicate them to all cluster nodes. In case of 3-controller cluster that should reduce load on each RabbitMQ node by a factor of 3.
Obviously, any failover will cause RPC messages loss (and OpenStack instability, as a result), but we will gain stability by reducing number of failovers.
We consider Ceilometer notifications to be important to users and so we do not want to reduce safety for Ceilometer messages. To clarify: the notifications are a common part of billing solutions, so users become upset when they disappear. The most problematic scenario is when notifications get accumulated in queue because of Ceilometer outage - subsequent failover in that cases causes major loss to billing data. That case is not relevant for RPC, as messages here are short lived - if a service does not process RPC message within a minute, the RPC operation times out and the message becomes irrelevant.
None
None
None
The whole change will be made in RabbitMQ OCF script, where queue policy is defined.
Instead of disabling HA completely, we can use ha-mode=exactly with count set to 2, 3, etc. But that will be much less effecive then disabling HA, since some replication will still take place.
The change does not affect OpenStack environment upgrade. Our current upgrade procedure (for 8.0) keeps 7.0 and 8.0 controllers separate, so RabbitMQ nodes from 7.0 and 8.0 will not be joined into the same cluster. As a result, we will not have a RabbitMQ cluster with constantly changing policies.
None
None
None
The change should positively affect OpenStack stability under load.
None
None
None
The change should be noted in the release notes.
None
As noted in work items, the change needs to be tested on 200 nodes to confirm that it helps reduce load on RabbitMQ.
None