|Author:||Carl Baldwin <email@example.com>|
|Copyright:||2014 Hewlett-Packard Development Company, L.P.|
On agent restart, the L3 agent loops through all routers to be sure they’re in sync with the database. This task can take over an hour on a heavily loaded system because of rootwrap, sudo and other inefficiencies. This task locks out RPC processing until it is done. From a user’s perspective, it appears that the system is completely unresponsive to floating ip and port changes.
On agent restart, the L3 agent immediately kicks off a periodic task called _sync_routers_task. This task grabs a semaphore which locks out the _rpc_loop until it is done. This makes the L3 agent unresponsive to new work coming in via RPC. Floating IPs need to wait to become active or inactive, router gateways don’t get plugged or unplugged, and subnet ports cannot be manipulated. This gives a poor impression to a user who has just made an API call to get something done.
This blueprint proposes unifying _sync_routers_task and _rpc_loop in to a single processing loop. This single loop will give priority to RPC messages. In other words, an RPC message about a given router will bump that router ahead in the queue before all of the routers that are in the queue from the _sync_routers_task code path.
The justification for prioritizing in this way is that _sync_routers_task requests maintenance updates to routers. It is meant to catch the somewhat unlikely case that a change was made to a router while the agent was down. RPC messages generally represent changes to the system that are being requested through the API in the moment. When you consider this, it is clear that RPC messages should be given precedence to improve the user experience.
To be fair, the _sync_routers_task is also helpful if the system reboots after a crash. In this case, these updates are more than just maintenance. However, in this case, each router on the system is already down. It is still prudent to respond to user requests with priority.
Each update will carry a timestamp so that they can be prioritized by time if there are many updates at once with the same priority.
The current L3 implementation allows processing many routers in parallel. In fact, there is virtually no bound on the number of routers that can be processed in parallel except for the limit of 1000 grean threads. In reality though, it is not practical to process more than 4-8 routers in parallel because there are enough contention points that prevent proper cooperation between the threads that most of them get starved anyway.
This blueprint implementation will create a _process_routers_loop to process all updates. This loop will use a green thread pool of fixed size. The loop will continuously spawn worker threads to ensure that the maximum number of workers are either processing a router or waiting on the queue for the next update to come in.
The size of the worker pool can easily be made configurable in a follow on to this blueprint if there is enough demand. However, based on testing done at scale, the size will initially be set to 8.
A new worker will spawn and immediately call _process_router_update. This method will immediately look to the queue for the next router update. It will block (friendly green thread style block) until one is available.
At this point, there are some timing and coordination issues to consider. The ExclusiveRouterProcessor class was designed to take care of these.
First, since there are multiple workers and many routers, we don’t want to have multiple workers touching the same router at once. To avoid this, the queue implementation will return an instance of ExclusiveRouterProcessor that guarantees that worker has exclusive access to the router, even if other update messages come in while it is being processed. This worker will be considered the master for this router until updates are finished.
Second, there is the possibility that a new update for the router will come in and bubble to the front of the priority queue while the router is being processed with outdated information. To handle this case, the worker that picks up this new update will try to get exclusive access to the router by creating an instance of ExclusiveRouterProcessor. This instance will realize that it is not the master processor and will simply append the update to the list of updates that the master instance will process.
When the master instance is done processing the router, it will check its queue of updates to see if the router needs to be processed again. If another update is found, it will process the router again, fetching new information from the DB. It will loop until there are no more updates. This covers the case where a user is actively making updates to a router over a period of time. The router simply needs to be processed several times in a row to respond to these updates until the user is finished.
It is important to note here that the new update must have bubbled all the way to the front of the priority queue and a worker needs to grab it off of the queue before the router processor will loop on the router and process it multiple times in a row. Without this important distinction, the algorthim would be subject to a denial of service attack where 8 routers could completely starve all of the other routers on the system.
The complexity in this class is mostly around making a guarantee that there is only one master processor for any given router. The rest is around the Timing issues discussed next.
Each update carries a timestamp that is initialized to the time when the update was received.
The ExclusiveRouterProcessor class carries one timestamp per router that is updated to just before a database query fetched the latest data about the router. This timestamp is not recorded until the router has been processed using those data. It will be recorded by calling the fetched_and_processed method. This is very important because the timestamp records the age of the data that was last used to complete an update to the router on the system. This handles the time delta between when the data were fetched and when the router is finally updated.
In the case of _sync_routers_task, the same timestamp is used for the update and for the age of the data since the system will immediately run a query to get all of the router data after the updates are created. However, the new implementation will still update all routers. They will just be updated with a lower priority than the RPC generated updates.
An update will be processed for a router iff the update timestamp is newer than the most recent router_data_timestamp.
Speeding up the L3 agent has been a work item for some time. Progress has been made in this area during the Icehouse time frame. For example, sudo was found to have an inefficiency that added 100 milliseconds or more to each invocation. This affected the L3 agent’s ability to plumb routers in a timely manner.
In the Juno timeframe, we will get a new daemon mode for rootwrap which will speed up the agent a great deal.
The bottom line is that speeding up the agent will not be enough. On an agent hosting hundreds of routers, there will still be a significant delay caused by the _sync_routers_task which will affect the end user experience.
Improved responsiveness to L3 changes made through the API following an agent restart.
This change will allow deployers of large scale cloud deployments using L3 agent to breathe easier. They will be able to deploy updates to the code base, restart the L3 agents and not worry about the effect it has on the system’s overall responsiveness.
No new gate tests will be required as this does not change functionality. The implementation will be fully unit tested including new tests to cover the functionality of the priority queue and router processor.