.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ================================================== Disable queue mirroring for RPC queues in RabbitMQ ================================================== https://blueprints.launchpad.net/fuel/+spec/rabbitmq-disable-mirroring-for-rpc RabbitMQ restarts caused by too high load impact OpenStack stability. In order to reduce load on RabbitMQ, it is proposed to disable mirroring for RPC queues and leave it enabled only for Ceilometer queues. Note: the feature will be expeimental in 8.0 and will be disabled by default. -------------------- Problem description -------------------- When a significant load is put on OpenStack, it in turn causes high load on RabbitMQ. As a result, some nodes in RabbitMQ cluster fail and that causes downtime for Nova, Neutron, Cinder and other OpenStack services while they reconnect to the remaining RabbitMQ nodes. We observe this issue on scale, especially when OpenStack is deployed with DVR enabled and the boot_and_delete_server_with_secgroups Rally test is run on 200 nodes. ---------------- Proposed changes ---------------- In order to mitigate the described problem, it is proposed to disable mirroring for OpenStack RPC queues. That way RPC messages will cause smaller impact on RabbitMQ as it will not need to replicate them to all cluster nodes. In case of 3-controller cluster that should reduce load on each RabbitMQ node by a factor of 3. Obviously, any failover will cause RPC messages loss (and OpenStack instability, as a result), but we will gain stability by reducing number of failovers. We consider Ceilometer notifications to be important to users and so we do not want to reduce safety for Ceilometer messages. To clarify: the notifications are a common part of billing solutions, so users become upset when they disappear. The most problematic scenario is when notifications get accumulated in queue because of Ceilometer outage - subsequent failover in that cases causes major loss to billing data. That case is not relevant for RPC, as messages here are short lived - if a service does not process RPC message within a minute, the RPC operation times out and the message becomes irrelevant. Web UI ====== None Nailgun ======= None Data model ---------- None REST API -------- None Orchestration ============= None RPC Protocol ------------ None Fuel Client =========== None Plugins ======= None Fuel Library ============ The whole change will be made in RabbitMQ OCF script, where queue policy is defined. ------------ Alternatives ------------ Instead of disabling HA completely, we can use ha-mode=exactly with count set to 2, 3, etc. But that will be much less effecive then disabling HA, since some replication will still take place. -------------- Upgrade impact -------------- The change does not affect OpenStack environment upgrade. Our current upgrade procedure (for 8.0) keeps 7.0 and 8.0 controllers separate, so RabbitMQ nodes from 7.0 and 8.0 will not be joined into the same cluster. As a result, we will not have a RabbitMQ cluster with constantly changing policies. --------------- Security impact --------------- None -------------------- Notifications impact -------------------- None --------------- End user impact --------------- None ------------------ Performance impact ------------------ The change should positively affect OpenStack stability under load. ----------------- Deployment impact ----------------- None ---------------- Developer impact ---------------- None --------------------- Infrastructure impact --------------------- None -------------------- Documentation impact -------------------- The change should be noted in the release notes. -------------- Implementation -------------- Assignee(s) =========== Primary assignee: dmitrymex Other contributors: None Mandatory design review: bogdando, sgolovatiuk, vkuklin Work Items ========== 1. Implement the change in the OCF script. 2. Test it on scale, verify that it significantly reduces CPU and/or memory consumption on 200 nodes, DVR, boot_and_delete_server_with_secgroups Rally test. 3. Perform destructive testing for messaging / RabbitMQ. Make sure our failover time did not get worse. Specific scenario to test: * Start up an oslo.messaging client and server. * Make client do periodic RPC calls to server each second. * Find the node hosting the queue used by the server and kill it. * See how many requests fail before client and server reconnect and recreate the queue. 4. Merge the change if it helps. Dependencies ============ None ------------ Testing, QA ------------ As noted in work items, the change needs to be tested on 200 nodes to confirm that it helps reduce load on RabbitMQ. Acceptance criteria =================== * The change considerably reduces load on RabbitMQ in scenario described in work item #2. There should be no RPC errors during normal operations (with all nodes working correctly). * In case of failover, the recovery time must not increase. That is measured by work item #3. ---------- References ---------- None