High Availability for Ironic Inspector¶
Ironic inspector is a service that allows bare metal nodes to be introspected dynamically, that currently isn’t redundant. The goal of this blueprint is to suggest conceptual changes to the inspector service that would make inspector redundant while maintaining both the current inspection feature set and API.
Inspector is a compound service consisting of the inspector API service, the firewall and the DHCP (PXE) service. Currently, all three components run a single instance on a shared host per OpenStack deployment. A failure of the host or any of the services renders introspection unavailable and prevents the cloud administrator from enrolling new hardware or from booting already enrolled baremetal nodes. Furthermore, Inspector isn’t designed to cope well with the amount of hardware required for Ironic bare metal usage at large scale. With a site size of 10k bare metal nodes in mind, we aim at the inspector sustaining a batch load of a couple of hundred introspection/enroll requests interleaved with couple of minutes of silence, maintaining a couple of thousand firewall black list items. We refer to this use case as bare metal to tenant.
Below we describe the current Inspector service architecture with some Inspector process instance failure consequences.
Node introspection is a sequence of asynchronous steps, controlled by the inspector API service, that take various amounts of time to finish. One could describe these steps as states of a transition system, advanced by events as follows:
startingthe initial state; the system is advanced into this state by receiving an introspect API request. Introspection configuration and set-up steps are performed while in this state.
waitingintrospection image is booting on the node. The system advances to this state automatically.
processingintrospection image has booted and collected necessary information from the node. This information is being processed by plug-ins to validate node status. The system is advanced to this state having received the
continueREST API request.
finishedintrospection is done, node powered-off. The system is advanced to this state automatically.
In case of an API service failure, nodes in-between the
finished state, will lose their state, and may require manual
intervention to recover. No more nodes can be processed either
because the API service runs in a single instance per deployment.
To minimize interference with normally deployed nodes, inspector deploys temporary firewall rules so only nodes being inspected can access its PXE boot service. It is implemented as a blacklist containing MAC addresses of nodes kept by ironic service but not by inspector. This is required because the MAC address isn’t known before a node boots for the first time.
Depending on the spot in which the API service fails while the firewall and DHCP services are intact, firewall configuration may get out of sync and may lead to interference with normal node booting:
firewall chain set-up (init phase): Inspector’s dnsmasq service is exposed to all nodes
firewall synchronization periodic task: new nodes added to Ironic aren’t blacklisted
node introspection finished: the node won’t be blacklisted
On the other hand, no boot interference is expected if running all services (inspector, firewall and DHCP), on the same host, as all service are lost together. Losing the API service during clean-up periodic task, should not matter as the nodes concerned will be kept blacklisted during service downtime.
DHCP (PXE) service¶
Inspector service doesn’t manage the DHCP service directly, rather, it just requires DHCP is properly set up and shares the host of the API service and the firewall. We’d anyway like to briefly describe the consequences of the DHCP service failing.
In case of a DHCP service failure inspected nodes won’t be able to boot the introspection ramdisk and eventually fail to get inspected because of a timeout. The nodes may loop retrying to boot depending on their firmware configuration.
A fail-over of DHCP from active to back-up host (dnsmasq usually) would
manifest with booting nodes under introspection timing out or nodes
already booted (with a lease of an address) getting into an address
conflict with another node booting. There’s not much to help the
former situation besides retrying. To prevent the latter from
happening, the configuration of DHCP service for the introspection
purpose should consider disjoint address pools served by the DHCP
instances such as recommended in IP address allocation between
section of the DHCP Failover Protocol RFC. We also recommend using
dhcp-sequential-ip in the dnsmasq configuration file to avoid
conflicts within the address pools. See related bug report for more
details on the issue. The introspection being an ephemeral matter,
synchronization of the leases between the DHCP instances isn’t
necessary if restarting introspection isn’t an issue.
Other Inspector parts¶
periodic introspection status clean-up, removing old inspection data and finishing timed-out introspections
synchronizing set of nodes with ironic
limiting node power-on rate with a shared lock and a timeout
In considering the problem of high availability, we are proposing a solution that consists of a distributed, shared-nothing, active-active implementation of all services that comprise the ironic inspector. From the user point of view, we suggest API service to serve through a load balancer, such as HAProxy, in order to maintain a single entry point for the API service (e.g. floating IP address).
HA Node Introspection decomposition¶
Node introspection being a state transition system, we focus on decentralizing it. We therefore replicate the current introspection state through the distributed store in all inspector process instances for particular node. We suggest that both the automatic state advancing requests as well as API state advancing requests are performed asynchronously by independent workers.
Each inspector process provides a pool of asynchronous workers that
get state transition requests from a queue. We use separate
queue.consume calls to avoid losing state
transition requests due to worker failures. This however introduces
the at-least-once delivery semantics to the requests. We therefore
rely on the transition-function to handle the request delivery
gracefully. We suggest two kinds of state-transition handling with
regards to the at-least-once delivery semantics:
Strict (non-reentrant-task) Transition Specification¶
Reentrant Task Transition Specification¶
Strict transition protecting a state change may lead to a situation that the state of introspection is not in correspondence with the node in reality — if a worker partitions right after having successfully executed the task but just before consuming the request from the queue. As a consequence the transition request not having been consumed will be encountered again with (another) worker. One can refer to this behavior as a reentrancy glitch or Déjà vu
Since the goal is to protect the inspected node from going through the
same task again, we rely on the state transition system to handle this
situation by navigating to the
error state instead.
Removing a node¶
Ironic synchronization periodic task puts node delete requests on the queue. Workers perform following steps to handle:
Failure of store removing the node isn’t a concern here as the periodic task will try again later. It is therefore safe to always consume the request here.
Shutting Down HA Inspector Processes¶
All inspector process instances register a
SIGTERM callback. To
notify inspector worker threads, the
SIGTERM callback sets the
sigterm_flag upon the signal delivery. The flag is process-local
and its purpose is to allow inspector processes to perform a
controlled/graceful shutdown. For this mechanism to work, potentially
blocking operations (such as
queue.get) have to be used with a
configurable timeout value within the workers. All sleep calls
throughout the process instance should be interruptible, possibly
sigterm_flag.wait(sleep_time) or similar.
Getting a request¶
any worker instance may execute any request the queue contains
worker gets state transition or node delete request from the queue
SIGTERMflag is set, worker stops
queue.gettimed-out (task is
None) poll the queue again
lock the BM node related to the request
if locking failed worker polls the queue again not consuming the request
Calculating new node state¶
worker instantiates a state transition system instance for current node state
if instantiating failed (e.g. no such node in the store) worker performs Retrying a request
worker advances the state transition system
if the state machine is jammed (illegal state transition request) worker performs Consuming a request
Updating node state¶
The introspection state is kept in the store, visible to all worker instances.
worker saves node state in the store
if saving node state in the store failed (such as node has been removed) worker performs Retrying a request
Executing a task¶
worker performs the task bound to the transition request
if the task result is a transition request worker puts it on the queue
Consuming a request¶
worker consumes the state transition request from the queue
worker releases related node lock
worker continues from the beginning
Retrying a request¶
worker releases node lock
worker continues from the beginning not consuming the request to retry later
Introspection State-Transition System¶
Node introspection state is managed by a worker-local instance of a state transition system. The state transition function is as follows.
the initial state
the terminal/accepting state
the automatic event originating in State
strict/non-reentrant transition event
HA Singleton Periodic task decomposition¶
Ironic inspector service houses a couple of periodic tasks. At any point, up to a single “instance” of a periodic task flavor should be running, no matter the process instances count. For this purpose, the processes form a periodic task distributed management party.
Process instances register a
SIGTERM callback that, the signal
being delivered, makes the process instance leave the party and switch
The process instances install a watch on the party. Upon the party
shrinkage, the processes reset their periodic task, if they have one
set, triggering the
reset_flag and participate in new distributed
periodic task management leader election. Party growth isn’t of
concern to the processes.
It’s because of the task reset due to the party shrinkage a custom
flag has to be used, instead of the
sigterm_flag, to stop the
periodic task. Otherwise, setting the
sigterm_flag because of the
party change would stop the whole service.
The leader process executes the periodic task loop. Upon exception or
partitioning, mind the partitioning-concerns, the leader stops
through flipping the
sigterm_flag in order for the inspector
service to stop. The periodic task loop is stopped eventually as it
reset_flag.wait(period) instead of sleeping.
The periodic task management should happen in a separate asynchronous thread instance, one per periodic task. Losing leader due to its error (or partitioning) isn’t a concern — a new one will eventually be elected and a couple of periodic task runs will be wasted (including those that died together with the leader).
HA Periodic clean-up decomposition¶
Clean-up should be implemented as independent HA singleton periodic tasks with configurable time period, one for each of the introspection timeout and ironic synchronization tasks.
Introspection timeout periodic task¶
To finish introspections that are timing-out:
select nodes for which the introspection is timing out
for each node:
put a request to time-out the introspection on the queue for a worker to process
Ironic synchronization periodic task¶
To remove nodes no longer tracked by Ironic:
select nodes that are kept by Inspector but not kept by Ironic
for each node:
put a request to delete the node on the queue for a worker to process
HA Reboot Throttle Decomposition¶
As a workaround for some hardware, reboot request rate should be
limited. For this purpose, a single distributed lock instance should
be utilized. At any point in time, only a single worker may hold the
lock while performing the reboot (power-on) task. Upon acquiring the
lock, the reboot state transition sleeps in an interruptible fashion
for a configurable quantum of time. If the sleep was indeed
interrupted, the worker should raise an exception stopping the reboot
procedure and the worker itself. This interruption should happen as
part of the graceful shutdown mechanism. This should be implemented
utilizing the same
SIGTERM flag/event workers use to check for
Process partitioning isn’t a concern here because all workers sleep while holding the lock. Partitioning therefore slows down the reboot pace by the amount of time a lock takes to expire. It should be possible to disable the reboot throttle altogether through the configuration.
HA Firewall decomposition¶
The PXE boot environment is configured and active on all inspector hosts. The firewall protection of the PXE environment is active on all inspector hosts, blocking the hosts’ PXE service. At any given point in time, at most one inspector host’s PXE service is available, and it is available to all inspected nodes.
The general policy is allow-all, and each node that is not being inspected has a block-exception to the general policy. Due to its size, the black-list is maintained locally on all inspector hosts, pulling items from ironic periodically or asynchronously from a pub–sub channel.
Nodes that are being introspected are white-listed in a separate set of firewall rules. Nodes that are being discovered for the first time fall through the black-list due to the general allow-all black-list policy.
Nodes the HA firewall is supposed to allow access to the PXE service,
are kept in a distributed store or obtained asynchronously from a
pub–sub channel. Process instance workers add (subtract) firewall
rules to (from) the distributed store as necessary or announce the
changes on the pub–sub channels. Firewall rules are
port_MAC) tuples to be white-/black-listed.
Process instances use custom chains to implement the firewall: the white-list chain and the black-list chain. Failing through the white-list chain, a packet “proceeds” to the black-list chain. Failing through the black-list chain, a packet is allowed to access the PXE service port. A node port rule may be present both in the white-list and the black-list chain at the same time if being introspected.
Starting, the processes poll Ironic to build their black-list chains for the first time and set up local periodic Ironic black-list synchronisation task or set callbacks on the black-list pub–sub channel.
Process instances form a distributed firewall management party that
they watch for changes. Process instances register a
callback that, the signal being delivered, makes the process instance
leave the party and reset the firewall, completely blocking their PXE
Upon the party shrinkage, processes reset their firewall white-list chain, the pass rule in the black-list chain, and the rule set watch (should they have one set) and participate in a distributed firewall management leader election. Party growth isn’t of concern to the processes.
The leader process’ black-list chain contains the pass rule while other process’s black-list chains don’t. Having been elected, the leader process builds the white-list and registers a watch on the distributed store or a white-list pub–sub channel callback in order to keep the white-list firewall chain up-to-date. Other process instances don’t maintain a white-list chain, that chain is empty for them.
Upon any exception (or process instance partitioning), a process resets its firewall to completely protect its PXE service.
Periodic white-list store polling and the white-list pub–sub channel callbacks are mutually optional facilities to enhance the responsiveness of the firewall, and the user may prefer enabling one or the other or both simultaneously as necessary. The same holds for the black-list Ironic polling and the black-list pub–sub channel callbacks.
To assemble the blacklist of MAC addresses, the processes may need to poll the ironic service periodically for node information. A cache/proxy of this information might be kept optionally to reduce the load on Ironic.
The firewall management should be implemented as a separate asynchronous thread in each inspector process instance. Firewall being lost due to the leader failure isn’t a concern — new leader will be eventually elected. Some nodes being introspected may experience a timeout in the waiting state and fail the introspection though.
Periodic Ironic–firewall node synchronization and white-list store
polling should be implemented as independent threads with configurable
0<=period<=15s so the
window between introducing a node to ironic and blacklisting it in
inspector firewall is kept below user’s resolution.
As an optimization, the implementation may consider offloading the MAC address rules of node ports from firewall chains into IP sets
HA HTTP API Decomposition¶
We assume a Load Balancer (HAProxy) shielding the user from the inspector service. All the inspector API process instances should export the same REST API. Each API Request should be handled in a separate asynchronous thread instance (as is the case now with the Flask framework). At any point in time, any of the process instances may serve any request.
Upon connection exception/worker process partitioning, affected entity
should retry connection establishing before announcing failure. The
retry count and timeout should be configurable for each of the ironic,
database, distributed store, lock and queue services. The timeout
should be interruptible, possibly implemented as waiting for
sigterm_flag.wait(timeout). Should the retrying fail,
affected entity breaks the worker inspector service altogether,
setting the flag, to avoid damage to resources — most of the time,
other worker service entities would be equally affected by the
partition anyway. User may consider restarting affected worker
service process instance when the partitioning issue is resolved.
Partitioning of HTTP API service instances isn’t a concern as those are stateless and accessed through a load balancer.
HA Worker Decomposition¶
We’ve briefly examined the TaskFlow library as alternate
tasking mechanism. Currently, TaskFlow does support only directed
acyclic graphs as dependency structure between
particular steps. Inspector service has to however support restarting
of the introspection for a particular node, bringing loops into the
graph; see transition-function. Moreover TaskFlow does not
support external event propagating to a running
flow, such as the
continue call from the bare metal node. Because
of that, the overall state of the introspection of particular node has
to be maintained explicitly if TaskFlow is adopted. TaskFlow, too,
requires tasks to be reentrant/idempotent.
Data model impact¶
State transition request item is introduced, it should contain these attributes (as an oslo.versioned) object:
A clean-up request item is introduced removing a node. Attributes comprising the request:
Two channels are introduced: firewall white-list and black-list. The message format is as follows:
port ID, MAC address
Node state column is introduced to the node table.
HTTP API impact¶
API service is provided by dedicated processes.
Client (CLI) impact¶
Performance and scalability impact¶
We hope this change brings in desired redundancy and scaling for the inspector service. We however expect the change to have a negative network utilization impact as the introspection task requires a queue and a DLM to coordinate.
The inspector firewall facility requires periodic polling of the ironic service inventory in each inspector instance. Therefore we expect increased load on the ironic service.
Firewall facility leader partitioning causes boot service outage for the election period. Some nodes may therefore timeout booting.
Each time the firewall leader updates the hosts firewall node information is polled from ironic service. This may introduce delays in firewall availability. If a node being introspected is removed from the ironic service, the change will not propagate to Inspector until the introspection finishes.
New services introduced that might require hardening and protection:
distributed locking facility
Inspector Service Configuration¶
distributed locking facility, queue, firewall pub–sub channels and load balancer introduce new configuration options, especially URLs/hosts and credentials
worker pool size, integral,
queue.get(timeout); 0.0s<timeout; timeout.default==3.0s
clean-up introspection report expiration threshold
clean-up introspection time-out threshold
ironic firewall black-list synchronization polling period
0.0s<=period<=30.0s; period.default==15.0s; period==0.0to disable
firewall white-list store watcher polling period
0.0s<=period<=30.0s; period.default==15.0s; period==0.0to disable
bare metal reboot throttle,
0.0s<=value; value.default==0.0sdisabling this feature altogether
for each of the ironic service, database, distributed locking facility and the queue, a connection retry count and connection retry timeout should be configured
all inspector hosts should share same configuration, save for the update situation
New services and minimal Topology¶
floating IP address shared by load balancers
load balancers, wired for redundancy
WSGI HTTP API instances (httpd), addressed by load balancers in a round-robin fashion
3 inspector hosts each running a worker process instance, dnsmasq instance and iptables
distributed synchronization facility hosts, wired for redundancy, accessed by all inspector workers
queue hosts, wired for redundancy, accessed by all API instances and workers
database cluster, wired for redundancy, accessed by all API instances and workers
NTP set up and configured for all the services
Please note, all inspector hosts require access to the PXE LAN for bare metal nodes to boot.
Considering service update, we suggest following procedure to be adopted for each inspector host, one at a time:
HTTP API services:
remove selected host from the load balancer service
stop the HTTP API service on the host
upgrade the service and configuration files
start the HTTP API service on the host
enroll the host to the load balancer service
for each worker host:
stop the worker service instance on the host
update the worker service and configuration files
start the worker service
Shutting down the inspector worker service may hang for some time due
to worker threads executing a long synchronous procedure or waiting in
queue.get(timeout) method while polling for new task.
This approach may lead to introspection (task) failures for nodes that are being handled on inspector host under update. Especially changes of the transition function (new states etc) may induce introspection errors. Ideally, the update should therefore happen with no ongoing introspections. Failed node introspections may be restarted.
A couple of periodic task “instances” may be lost due to the updated leader partitioning each time a host is updated. HA firewall may be lost for the leader election period each time a host is updated, expected delay should be less than 10 seconds so that booting of inspected nodes isn’t affected.
Upgrade from non-HA Inspector Service¶
Because the non-HA inspector service is a single-process entity and because the HA services aren’t internally backwards compatible with it (to allow taking-over running node inspections), to perform an upgrade, the non-HA service has to be stopped first while no inspections are ongoing. Data migration is necessary before the upgrade. As the new services require the queue and the DLM for their operation those have to be introduced before the upgrade. The worker services have to be started before HTTP API services. Having started, the HTTP API services have to be introduced to the load balancer.
We consider following implementations for the facilities we rely on:
load balancer: HAProxy
queue: Oslo messaging
pub–sub firewall channels: Oslo messaging
store: a database service
distributed synchronization facility: Tooz
HTTP API service: WSGI and httpd
replace current locking with Tooz DLM
introduce state machine
split API service and introduce conductors and queue
split cleaning into a separate timeout and synchronization handlers and introduce leader-election to these periodic procedures
introduce leader-election to the firewall facility
introduce the pub–sub channels to the firewall facility
We require proper inspector grenade testing before landing HA so we avoid breaking users as much as possible.
All work items should be tested as separate patches both with functional and unit tests as well as upgrade tests with Grenade.
Having landed all the required work items it should be possible to test Inspector with focus on redundancy and scaling.
During the analysis process we considered these blueprints:
Joshua Harlow’s comment that Tooz should implement the at-least-once semantics not Oslo.messaging