RFE: https://bugs.launchpad.net/neutron/+bug/1516195
Launchpad blueprint: https://blueprints.launchpad.net/neutron/+spec/push-notifications
The current method we use to get information from the server to the agent is driven by notification and error-triggered calls to the server by the agent. So during normal operation, the server will send out a notification that a specific object has changed (e.g. a port) and then the agent will respond to that by querying the server for information about that port. If the agent encounters a failure while processing changes, it will start over and re-query the server in the process.
The load on the server from this agent-driven approach can be very unpredictable depending on the changes to object states on the neutron server. For example, a single network update will result in a query from every L2 agent with a port on that network.
This blueprint aims to change the pattern we use to get information to the agents to primarily be based on pushing the object state out in the change notifications. For anything not changed to leverage this method of retrieval (e.g. initial agent startup still needs to poll), the AMQP timeout handling will be fixed to ensure it has an exponential back-off to prevent the agents from stampeding the server.
An outage of a few agents and their recovery can lead to all of the agents drowning the neutron servers with requests. This can cause the neutron servers to fail to respond in time, which results in more retry requests building up, leaving the entire system useless until operator intervention.
This is caused by 3 problems:
Eliminate expensive cases where calls are made to the neutron server in response to a notification generated by the server. In most of these cases where the agent is just asking for regular neutron objects (e.g. ports, networks), we can leverage the RPC callbacks mechanism introduced in Liberty[1] to have the server send the entire changed object as part of the notification so the agent has the information it needs.
The main targets for this will be the security group info call, the get_device_details call, and the sync_routers call. Others will be included if the change is trivial once these three are done. The DHCP agent already relies on push notifications, so it will just be updated to use the revision number to detect the out of order events it’s susceptible to now.
For the remaining calls that cannot easily be converted into the callbacks mechanism (e.g. the security groups call which blends several objects, the initial synchronization mechanism, and agent-generated calls), a nicer timeout mechanism will be implemented with an exponential back-off and timeout increase so a heavily loaded server is not continuously hammered to death.
The current issue with the RPC callback mechanism and sending objects as notifications is a lack of server operation ordering guarantees and AMQP message ordering guarantees.
To illustrate the first issue, examine the following order of events that happen when two servers update the same port:
If the agent receives the notifications in the order in which they are delivered to AMQP, it will think the state delivered by Server 1 is the current state when it is actually the state committed by Server 2.
We also have the same issue when oslo messaging doesn’t guarantee message order (e.g. ZeroMQ). Even if Server 1 sends immediately after its commit and before Server 2 commits and sends, one or more of the agents could end up seeing Server 2’s message before Server 1’s.
To handle this, we will add a revision number, implemented as a monotonic counter, to each object. This counter will be incremented on any update so any agent can immediately identify stale messages.
To address deletes arriving before updates, agents will be expected to keep a set of the UUIDs that have been deleted. Upon receiving an update, the agent will check this set for the object’s UUID and ignore the update since deletes are permanent and UUIDs cannot be re-used. If we do make IDs recyclable in the future, this can be replaced with a strategy to confirm ID existence with the server or we can add another internal UUID that cannot be specified.
Note that this doesn’t guaruntee message ordering for the agent because that is a property of the message backend, but it does give the agent the necessary info to re-order messages when it receives them so they can determine which one reflects the more recent state of the DB.
A ‘revision_number’ column will be added to the standard attr table. This column will just be a simple big integer used as monotonic counter that will be updated whenever the object is updated on the neutron server. This revision number can then be used by the agents to automatically discard any object states that are older than the state they already have.
This revision_number will use the version counter feature which is built-in to SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/versioning.html Each time an object is updated, the server will perform a compare-and-swap operation based on the revision number. This ensures that each update must start with the current revision number or it will fail with a StaleDataError. The API layer can catch this error with the current DB retry mechanism and start over with the latest revision number.
While SQLAlchemy will automatically bump the revision for us when the record for an object is updated (e.g. a standard attr description field), it will not update it if it’s a related object changing (e.g. adding an IP address to the port or changing its status). So we will have to manually trigger the revision bump (either via a PRECOMMIT callback or inline code) for any operations that we want to trigger the revision number bump.
What this guarantees:
What this doesn’t guarantee:
Making existing notifications significantly more data-rich. The hope here is to eliminate many of the expensive RPC calls that each agent makes and have each agent derive all state from notifications with one sync method for recovery/initialization that we can focus on optimizing.
This will result in more data being sent up front by the server to the messaging layer, but it will eliminate the data that would be sent in response to a call request from the agent in the current pattern. For a single agent, the only gain is the elimination of the notification and call messages; but for multiple agents interested in the same resource, it eliminates extra DB calls and extra messages from the server to fulfill those calls.
This pattern will result in fewer messages sent to oslo messaging because of the elimination of the calls from the agents that would result in the same payload we are preemptively broadcasting once instead of casting multiple times to each requesting agent.
Higher ratio of neutron agents per server afforded by a large reduction in sporadic queries by the agents.
This comes at a cost of effectively serializing operations on an individual object due to the compare and swap operation on the server. For example, if two server threads try to update a single object concurrently and both read the current state of the object at the same time, one will fail on commit with a StaleDataError which will be retried by the API layer. Previously both of these would succeed because the UPDATE statement would have no compare-and-swap WHERE criteria. However, this is a very reasonable performance cost to pay considering that concurrent updates to the same API object are not common.
N/A - upgrade path will maintain normal N-1 backward compatibility on the server so all of the current RPC endpoints will be left untouched for one cycle.
Need to change development guidelines to avoid the implementation of new direct server calls.
The notifications will have to send out oslo versioned objects since notifications don’t have RPC versions. So at a minimum we need to switch to oslo versioned objects in the notification code if we can’t get them fully implemented everywhere else. To do this we can leverage the RPC callbacks mechanism.
Maintain the current information retrieval pattern and just adjust the timeout mechanism for everything to include back-offs or use cast/cast instead of calls. This will allow a system to automatically recover from self-induced death by stampede, but it will not make the performance any more predictable.
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.