.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ============================= Push Notifications for Agents ============================= RFE: https://bugs.launchpad.net/neutron/+bug/1516195 Launchpad blueprint: https://blueprints.launchpad.net/neutron/+spec/push-notifications The current method we use to get information from the server to the agent is driven by notification and error-triggered calls to the server by the agent. So during normal operation, the server will send out a notification that a specific object has changed (e.g. a port) and then the agent will respond to that by querying the server for information about that port. If the agent encounters a failure while processing changes, it will start over and re-query the server in the process. The load on the server from this agent-driven approach can be very unpredictable depending on the changes to object states on the neutron server. For example, a single network update will result in a query from every L2 agent with a port on that network. This blueprint aims to change the pattern we use to get information to the agents to primarily be based on pushing the object state out in the change notifications. For anything not changed to leverage this method of retrieval (e.g. initial agent startup still needs to poll), the AMQP timeout handling will be fixed to ensure it has an exponential back-off to prevent the agents from stampeding the server. Problem Description =================== An outage of a few agents and their recovery can lead to all of the agents drowning the neutron servers with requests. This can cause the neutron servers to fail to respond in time, which results in more retry requests building up, leaving the entire system useless until operator intervention. This is caused by 3 problems: * We don't make optimal use of server notifications. There are times when the server will send a notification to an agent to inform it that something has changed. Then the agent has to make a call back to the server to get the relevant details. This means a single L3 rescheduling event of a set of routers due to a failed L3 agent can result in N more calls to the server where N is the number of routers. Compounding this issue, a single agent may make multiple calls to the server for a single operation (e.g. the L2 agent will make one call for port info, and then another for security group info). * The agents will give up after a short period of time on a request and retry the request or issue an even more expensive request (e.g. if synchronizing info for one item fails, a major issue is assumed so a request to sync all items is issued). So by the time the server finishes fulfilling a request, the client is no longer waiting for the response so it goes in the trash. As this compounds, it leaves the server processing a massive queue of requests that won't even have listeners for the responses. * Related to the second item is the fact that the agents are aggressive in their retry mechanisms. If a request times out, that request is immediately retried with the same timeout value; that is, they have no back-off mechanism. (This has now been addressed by https://review.openstack.org/#/c/280595/ which adds backoff, sleep, and jitter.) Proposed Change =============== Eliminate expensive cases where calls are made to the neutron server in response to a notification generated by the server. In most of these cases where the agent is just asking for regular neutron objects (e.g. ports, networks), we can leverage the RPC callbacks mechanism introduced in Liberty[1] to have the server send the entire changed object as part of the notification so the agent has the information it needs. The main targets for this will be the security group info call, the get_device_details call, and the sync_routers call. Others will be included if the change is trivial once these three are done. The DHCP agent already relies on push notifications, so it will just be updated to use the revision number to detect the out of order events it's susceptible to now. For the remaining calls that cannot easily be converted into the callbacks mechanism (e.g. the security groups call which blends several objects, the initial synchronization mechanism, and agent-generated calls), a nicer timeout mechanism will be implemented with an exponential back-off and timeout increase so a heavily loaded server is not continuously hammered to death. Changes to RPC callback mechanism --------------------------------- The current issue with the RPC callback mechanism and sending objects as notifications is a lack of server operation ordering guarantees and AMQP message ordering guarantees. To illustrate the first issue, examine the following order of events that happen when two servers update the same port: * Server 1 commits update to DB * Server 2 commits update to DB * Server 2 sends notification * Server 1 sends notification If the agent receives the notifications in the order in which they are delivered to AMQP, it will think the state delivered by Server 1 is the current state when it is actually the state committed by Server 2. We also have the same issue when oslo messaging doesn't guarantee message order (e.g. ZeroMQ). Even if Server 1 sends immediately after its commit and before Server 2 commits and sends, one or more of the agents could end up seeing Server 2's message before Server 1's. To handle this, we will add a revision number, implemented as a monotonic counter, to each object. This counter will be incremented on any update so any agent can immediately identify stale messages. To address deletes arriving before updates, agents will be expected to keep a set of the UUIDs that have been deleted. Upon receiving an update, the agent will check this set for the object's UUID and ignore the update since deletes are permanent and UUIDs cannot be re-used. If we do make IDs recyclable in the future, this can be replaced with a strategy to confirm ID existence with the server or we can add another internal UUID that cannot be specified. Note that this doesn't guaruntee message ordering for the agent because that is a property of the message backend, but it does give the agent the necessary info to re-order messages when it receives them so they can determine which one reflects the more recent state of the DB. Data Model Impact ----------------- A 'revision_number' column will be added to the standard attr table. This column will just be a simple big integer used as monotonic counter that will be updated whenever the object is updated on the neutron server. This revision number can then be used by the agents to automatically discard any object states that are older than the state they already have. This revision_number will use the version counter feature which is built-in to SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/versioning.html Each time an object is updated, the server will perform a compare-and-swap operation based on the revision number. This ensures that each update must start with the current revision number or it will fail with a StaleDataError. The API layer can catch this error with the current DB retry mechanism and start over with the latest revision number. While SQLAlchemy will automatically bump the revision for us when the record for an object is updated (e.g. a standard attr description field), it will not update it if it's a related object changing (e.g. adding an IP address to the port or changing its status). So we will have to manually trigger the revision bump (either via a PRECOMMIT callback or inline code) for any operations that we want to trigger the revision number bump. What this guarantees: - An object in a notification is newer (from a DB state perspective) than an object with a lower revision number. So any objects with lower revision numbers can safely be ignored since they represent stale DB state. What this doesn't guarantee: - Message ordering 'on the wire'. An AMQP listener may end up receiving an older state than a message it has already received. It's up to the listener to look at the revision number to determine if the message is stale. - That each intermediary state is transmitted. If a notification mechanism reads the DB to get the full object to send, the DB state may have progressed so it will notify with the latest state than the state that triggered the original notification. This is acceptable for all of our use cases since we only care about the current state of the object to wire up the dataplane. It is also effectively what we have now since the DB state could change between when the agent gets a notification and when it actually asks the server for details. - Reliability of the notifications themselves. This doesn't address the issue we currently have where a dropped notification is not detected. Notifications Impact -------------------- Making existing notifications significantly more data-rich. The hope here is to eliminate many of the expensive RPC calls that each agent makes and have each agent derive all state from notifications with one sync method for recovery/initialization that we can focus on optimizing. This will result in more data being sent up front by the server to the messaging layer, but it will eliminate the data that would be sent in response to a call request from the agent in the current pattern. For a single agent, the only gain is the elimination of the notification and call messages; but for multiple agents interested in the same resource, it eliminates extra DB calls and extra messages from the server to fulfill those calls. This pattern will result in fewer messages sent to oslo messaging because of the elimination of the calls from the agents that would result in the same payload we are preemptively broadcasting once instead of casting multiple times to each requesting agent. Performance Impact ------------------ Higher ratio of neutron agents per server afforded by a large reduction in sporadic queries by the agents. This comes at a cost of effectively serializing operations on an individual object due to the compare and swap operation on the server. For example, if two server threads try to update a single object concurrently and both read the current state of the object at the same time, one will fail on commit with a StaleDataError which will be retried by the API layer. Previously both of these would succeed because the UPDATE statement would have no compare-and-swap WHERE criteria. However, this is a very reasonable performance cost to pay considering that concurrent updates to the same API object are not common. Other Deployer Impact --------------------- N/A - upgrade path will maintain normal N-1 backward compatibility on the server so all of the current RPC endpoints will be left untouched for one cycle. Developer Impact ---------------- Need to change development guidelines to avoid the implementation of new direct server calls. The notifications will have to send out oslo versioned objects since notifications don't have RPC versions. So at a minimum we need to switch to oslo versioned objects in the notification code if we can't get them fully implemented everywhere else. To do this we can leverage the RPC callbacks mechanism. Alternatives ------------ Maintain the current information retrieval pattern and just adjust the timeout mechanism for everything to include back-offs or use cast/cast instead of calls. This will allow a system to automatically recover from self-induced death by stampede, but it will not make the performance any more predictable. Implementation ============== Assignee(s) ----------- Primary assignee: kevinbenton Ihar Hrachyshka Work Items ---------- * Exponential back-off for timeouts on agents * Implement 'revision' extension to add the revision_number column to the data-model and expose it as a standard attribute. * Write tests to ensure revisions are incremented as expected * Write (at least one) test that verifies a StaleDataError is triggered in the event of concurrent updates. * Update DHCP agent to make use of this new 'revision' field to discard stale updates. This will be used as the proof of concept for this approach since the DHCP agent is currently exposed to operating on stale data with out of order messages. * Replace the use of sync_routers calls on the L3 agents for the most frequent operations (e.g. floating IP associations, etc) with RPC callbacks once the OVO work allows it. * Stand up grenade partial job to make sure agents using different OVO versions maintain N-1 compatibility * Update devref for callbacks Possible Future Work -------------------- * Switch to cast/cast pattern so agent isn't blocked waiting on server * Setup a periodic system based on these revision numbers to have the agents figure out if they have lost updates from the server. (e.g. periodic broadcasts of revision numbers and UUIDs, sums of collections of revisions, etc.). * Add an 'RPC pain multiplier' option that just causes all calls to the neutron server to be duplicated X number of times. That way we can set it to something like 200 for the gate which will force us to make every call reasonably performant. * Allow the HTTP API to perform compare and swap updates by placing an if-match header with the revision number, which would cause the update to fail if the version changed. Testing ======= * The grenade partial job will be important to ensure we maintain our N-1 backward compatibility with agents from the previous release. * API tests will be added to ensure the basic operation of the revision numbers * Functional and unit tests to test the agent reactions to payloads Documentation Impact ==================== User Documentation ------------------ N/A Developer Documentation ----------------------- Devref guidelines on the pattern for getting information to agents and what the acceptability criteria are for calls to the server. RPC callbacks devref will need to be updated with notification strategy. References ========== 1. http://git.openstack.org/cgit/openstack/neutron/tree/doc/source/devref/rpc_callbacks.rst 2. https://www.rabbitmq.com/semantics.html#ordering