Splitting Inspector into an API and a Worker¶
https://bugs.launchpad.net/ironic-inspector/+bug/1525218
This work is part of the High Availability for Ironic Inspector spec. One of the items to achieve inspector HA and scalability is splitting inspector single service into API and worker services. This spec focuses on detailed description of the essential part of the mentioned work - internal communication between mentioned services.
Problem description¶
inspector is a monolithic service consisting of the API, processing background, the firewall and the DHCP managment. As result inspector isn’t capable of dealing well with the sizeable amount of ironic bare metal nodes and doesn’t fit large-scale deployments. Introducing new services to solve this issues also brings some complexity as it requires a mechanism for internal services communication.
Proposed change¶
Node introspection is a sequence of asynchronous tasks. A task could be described as an FSM transition of the inspector state machine 1, triggered by events such as:
starting(wait) -> waiting
waiting(process) -> processing
waiting(timeout) -> error
processing(finish) -> finished
API request that are executed in the background can be considered as asynchronous tasks. It is these tasks that allow splitting the service into the API and Worker parts, the former creating tasks, the latter consuming those. The communication of these service parts requires a medium, the queue, and together these three subjects comprise the messaging queue paradigm. OpenStack projects use an open standard for messaging middleware known as AMQP. This messaging middleware, oslo.messaging, enables services that run on multiple servers to talk to each other.
Each inspector worker provides a pool of worker threads that get state
transition requests from the API service via the queue. An API service
invokes methods on workers and eventually becomes a task. In other words,
there is the client
role, carried out by the API service, and the
server
role, carried out by the worker thread respectively. Servers make
oslo.messaging RPC
interfaces available to clients.
Client - inspector API¶
inspector API will implement a simple oslo.messaging client, which will connect to the messaging transport and send messages with state transition event.
- There are two ways that a method can be invoked, see 2:
cast - the method is invoked asynchronously and no result is returned to the caller.
call - the method is invoked synchronously and the result is returned to the caller.
inspector endpoints which invokes RPC:
Method |
RPC type |
API |
Worker |
---|---|---|---|
POST /introspection/<node_id> |
cast |
check provision state, validate power interface, set starting state, <RPC> cast inspect |
add lookup attributes, update pxe filters, set pxe as boot device, reboot node, set waiting state |
POST /continue |
cast |
node lookup, check provision state, <RPC> cast process |
set processing state run processing hook, apply rules, update pxe filters, save introspection data, power off node, set finished state |
|
cast |
find node in cache, <RPC> cast abort |
force power off, update pxe filters, set error state |
|
cast |
find node in cache, <RPC> cast reapply |
get introspection data, set reapplying state, run processing hooks, save introspection data, apply rules, set finished state |
The resulting workflow for introspection looks like:
Client API Worker Node Ironic
+ + + + +
| <HTTP>Start | | | |
+--inspection---> | | |
| X Validate power| | |
| X interface, | | |
| X initiate task | | |
| X for inspection| | |
| X | | |
| X <RPC> Cast | | |
X +- inspection---> | |
X | X Update pxe | |
X | X filters, set | |
X | X lookup attrs | |
X | X | |
X | X <HTTP> Set pxe | |
X | +-boot dev,reboot+--------------->
X | | | Reboot |
X | | <---------------+
X | | DHCP, boot, X |
X | | Collect data X |
X | | X |
X |Send inspection data to inspector |
X <---------------+----------------+ |
X X Node lookup, | | |
X X verify collect| | |
X X failures | | |
X X | | |
X X <RPC> Cast | | |
X +-process data--> | |
X | X Run process | |
X | X hooks, apply | |
X | X rules, update | |
X | X filters | |
X | X <HTTP> Set power off |
X | +----------------+--------------->
X | | | Power off |
X + + <------------- +
Server - inspector worker¶
An RPC server exposes inspector endpoints containing a set of methods, which may be invoked remotely by clients over a given transport. Transport driver will be loaded according to the users messaging configuration. See 3 for more details on configuration options.
An inspector worker will implement a separate oslo.service
process
with its own pool of green threads. The worker will periodically consume and
handle messages from the clients.
RPC reliability¶
For each message sent by the client via cast (asynchronously), an acknowledgement is sent back immediately and the message is removed from the queue. As result there is no guarantees that worker will handle the introspection task.
This model, known as at-most-once-delivery doesn’t guarantee message processing for asynchronous tasks if proceed worker dies. Supporting HA may require some additional functionality to confirm that task message was processed.
If a worker dies (connection is closed or lost) during processing inspection
data, the task request message will disappear and the introspection task will
hang in processing
state till timeout happens.
Alternatives¶
Implement our own Publisher/Consumer functionality with Kombu library. This approach has some benefits:
support at-least-once-delivery semantic. For each message retrieved and handled by a consumer, an acknowledgement is sent back to the message producer. In case this acknowledgement is not received after a certain time amount, the message is resent:
API Worker thread + + | | +--------------------->+ | | | +--------+ | | | | Process| | | Request| | | | | | +------->+ | ACK | +<---------------------+ | | + +If a consumer dies without sending an ack, the message wasn’t processed and if there are other consumers online at the same time, message will be reprocessed.
On the other hand, these approach has considerable drawbacks:
Implementing own Publisher/Consumer. It means complexity of supporting new functionality, lack of supported backends, compared to oslo.messaging, like 0MQ.
Worse deployer’s UX. Message backend configuration in inspector will differ from other services (including ironic), which brings some pain to deployers.
Data model impact¶
None
HTTP API impact¶
Endpoint /continue will return ACCEPTED instead of OK.
Client (CLI) impact¶
None
Ironic python agent impact¶
None
Performance and scalability impact¶
Proposed change will allow users to scale ironic-inspector, both API and Worker, horizontally after some more work in future, for more details refer to High Availability for Ironic Inspector.
Security impact¶
The newly introduced services require additional protection. The messaging service, which would be used as the transport layer e.g RabbitMQ, should rely on a transport-level cryptography, see 4 for more details.
Deployer impact¶
The newly introduced message bus layer will require some message broker to connect the inspector API and workers. The most popular broker implementation used in OpenStack installations is RabbitMQ, see 5 for more details.
To achieve resiliency, multiple API service and worker service instances should be deployed on multiple physical hosts.
There are also new configuration options being added, see 3
Developer impact¶
Developers will need to consider new architecture and inspector API and Worker communication details when adding new features which are required to be handled as background tasks.
Upgrades and Backwards Compatibility¶
The current inspector service is a single process, so deployers might need
to add more services, newly added inspector Worker, the messaging transport
backend (RabbitMQ) . Console script ironic-inspector
could be changed
to run both API and Worker services with in-memory
backend for messaging
transport. Which allows to run ironic-inspector
in backward compatibility
manner - run both services on single host without the message broker.
Dependencies¶
None
Testing¶
All new functionality would be tested both with functional and unit tests. Already running Tempest tests as well as upgrade tests with Grenade will also cover added features.
Functional tests run both Inspector API and Worker with an in-memory
backend.
Having all the work items done will allow to setup multi-node devstack and test Inspector in cluster mode eventually.