Highly available routers is a new functionality that was merged in the l3-high-availability blueprint. HA routers are scheduled on multiple L3 agents however the cloud operator has no way of knowing where the active instance is.
A cloud operator can know which L3 agents are providing a router, but not where the active instance is. Legacy routers may be manually moved from one agent to another. With HA routers, the equivalent is moving the active instance, but that is not currently possible. The first step is to know where the active instance, which will be addressed in this blueprint, however setting the location of the active instance is out of scope and will be addressed in the future.
The operator might want to perform node maintenance which is assisted by manually moving routers from the node. Likewise the operator might want to see the state of routers after a failover (Did the active instance actually failover?).
Currently shows all L3 agents hosting the router. It will now also show the HA state (Active, standby or fault) of said router on every agent.
+-----------+------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +-----------+------+----------------+-------+----------+ | 534c4b37- | net1 | True | :-) | active | | da2730c6- | net2 | True | :-) | standby | | 7abcd991- | net3 | True | xxx | fault | +-----------+------+----------------+-------+----------+
Keepalived doesn’t support a way to query the current VRRP state. The only way to know then is to use notifier scripts. These scripts are executed when a state transition occurs, and receive the new state (Master, backup, fault).
Every time we reconfigure keepalived (When the router is created or updated) we tell it to execute a Python script (That is maintained as part of the repository).
The script will:
The L3 agent will start and stop the metadata proxy when it receives a notification. This is to save on memory usage by enabling the proxy only on the active instance. This can be important at scale as every proxy takes 20+ MBs.
The L3 agent will batch these state change notifications over a period of T seconds. When T seconds have passed and no new notifications have arrived it will send a RPC message to the server with a map of router ID to VRRP state on that specific agent. How it works is that once an event is received by the agent, it batches all future events over a period of T seconds. When the timer goes off, it sends all of the state changes in a single message to the controller. Additionally, every time the agent starts it gets a list of routers scheduled on the agent. The agent will now loop through said routers, collect their HA states from disk and update the server. This is to catch any state changes that occurred if and when an agent was down. If a router changes states multiple times during the batching period, the agent will only send the most up to date state.
The RPC message send will be retried in case the management network is temporarily down, or the agent is disconnected from it.
The server will then persist this information following the RPC message: The tables are already set up for this. Each router has an entry in the HA bindings table per agent it is scheduled to, and the record contains the VRRP state on that specific agent. The controller will also persist the last time a state change was received, so that in a split brain situation the admin would be able to understand which is the ‘real’ master by observing the time stamps.
Optionally*, the server will look for dead agents (That have not sent heartbeats in a while) and will mark their HA routers as down. This will aid the main use case of a hypervisor dying (Of course not being able to report of any state changes), and another hypervisor hosting all of the routers. In this case the API will return ‘active’ for all routers on both machines until the server notices that the first agent died and marks its routers as down.
The HA state of every router to agent binding is persisted in the L3HARouterAgentPortBinding table. It is currently unused. A DB migration will be necessary in order to add time stamps as well as the ‘fault’ state, as currently only the ‘active’ and ‘standby’ can be persisted.
l3-agent-list-hosting-router will now return an extra column that can be ‘active’, ‘standby’ or ‘fault’ for HA routers, or None for other types of routers.
keepalived runs as root, as does the transition script that it invokes. The transition script talks to the agent via a Unix domain socket.
python-neutronclient will support the new ha_state column. It will show ‘active’, ‘standby’ or ‘fault’ when a proper response is received. ‘-‘ will be displayed if None is received by an old server or for non-HA routers.
Assuming two L3 agents and 1,000 routers hosted on each, a failover from node 1 to node 2 should induce only a single RPC call from node 2 to the server, and a single DB transaction.
Instead of neutron-keepalived-state-change notifying the agent via a Unix domain socket, the agent could poll for the state of all HA routers every T seconds. It would then diff the new states against a cached copy and notify the server of any changes. One could argue that this is simpler to implement and maintain, but is less performant.
L3 HA cannot be tested in Tempest without multi-node support. L3 HA is the first candidate to be tested when in-tree integration tests are introduced via the integration-tests blueprint.
The L3 agent already has functional testing in place. Two new tests will be added:
The RPC and DB methods will be tested with unit tests.
The changes to the API and CLI require documentation.
The CLI client documentation must be updated.
The Neutron API change must be documented.