Splitting Inspector into an API and a Worker

https://bugs.launchpad.net/ironic-inspector/+bug/1525218

This work is part of the High Availability for Ironic Inspector spec. One of the items to achieve inspector HA and scalability is splitting inspector single service into API and worker services. This spec focuses on detailed description of the essential part of the mentioned work - internal communication between mentioned services.

Problem description

inspector is a monolithic service consisting of the API, processing background, the firewall and the DHCP managment. As result inspector isn’t capable of dealing well with the sizeable amount of ironic bare metal nodes and doesn’t fit large-scale deployments. Introducing new services to solve this issues also brings some complexity as it requires a mechanism for internal services communication.

Proposed change

Node introspection is a sequence of asynchronous tasks. A task could be described as an FSM transition of the inspector state machine 1, triggered by events such as:

  • starting(wait) -> waiting

  • waiting(process) -> processing

  • waiting(timeout) -> error

  • processing(finish) -> finished

API request that are executed in the background can be considered as asynchronous tasks. It is these tasks that allow splitting the service into the API and Worker parts, the former creating tasks, the latter consuming those. The communication of these service parts requires a medium, the queue, and together these three subjects comprise the messaging queue paradigm. OpenStack projects use an open standard for messaging middleware known as AMQP. This messaging middleware, oslo.messaging, enables services that run on multiple servers to talk to each other.

Each inspector worker provides a pool of worker threads that get state transition requests from the API service via the queue. An API service invokes methods on workers and eventually becomes a task. In other words, there is the client role, carried out by the API service, and the server role, carried out by the worker thread respectively. Servers make oslo.messaging RPC interfaces available to clients.

Client - inspector API

inspector API will implement a simple oslo.messaging client, which will connect to the messaging transport and send messages with state transition event.

There are two ways that a method can be invoked, see 2:
  • cast - the method is invoked asynchronously and no result is returned to the caller.

  • call - the method is invoked synchronously and the result is returned to the caller.

inspector endpoints which invokes RPC:

Method

RPC type

API

Worker

POST /introspection/<node_id>

cast

check provision state, validate power interface, set starting state, <RPC> cast inspect

add lookup attributes, update pxe filters, set pxe as boot device, reboot node, set waiting state

POST /continue

cast

node lookup, check provision state, <RPC> cast process

set processing state run processing hook, apply rules, update pxe filters, save introspection data, power off node, set finished state

POST /introspection/<node_id>

/abort

cast

find node in cache, <RPC> cast abort

force power off, update pxe filters, set error state

POST /introspection/<id>/data

/unprocessed

cast

find node in cache, <RPC> cast reapply

get introspection data, set reapplying state, run processing hooks, save introspection data, apply rules, set finished state

The resulting workflow for introspection looks like:

Client           API            Worker           Node           Ironic
  +               +               +                +               +
  | <HTTP>Start   |               |                |               |
  +--inspection--->               |                |               |
  |               X Validate power|                |               |
  |               X interface,    |                |               |
  |               X initiate task |                |               |
  |               X for inspection|                |               |
  |               X               |                |               |
  |               X  <RPC> Cast   |                |               |
  X               +- inspection--->                |               |
  X               |               X Update pxe     |               |
  X               |               X filters, set   |               |
  X               |               X lookup attrs   |               |
  X               |               X                |               |
  X               |               X <HTTP> Set pxe |               |
  X               |               +-boot dev,reboot+--------------->
  X               |               |                |     Reboot    |
  X               |               |                <---------------+
  X               |               |    DHCP, boot, X               |
  X               |               |   Collect data X               |
  X               |               |                X               |
  X               |Send inspection data to inspector               |
  X               <---------------+----------------+               |
  X               X Node lookup,  |                |               |
  X               X verify collect|                |               |
  X               X failures      |                |               |
  X               X               |                |               |
  X               X   <RPC> Cast  |                |               |
  X               +-process data-->                |               |
  X               |               X Run process    |               |
  X               |               X hooks, apply   |               |
  X               |               X rules, update  |               |
  X               |               X filters        |               |
  X               |               X     <HTTP> Set power off       |
  X               |               +----------------+--------------->
  X               |               |                |  Power off    |
  X               +               +                <-------------  +

Server - inspector worker

An RPC server exposes inspector endpoints containing a set of methods, which may be invoked remotely by clients over a given transport. Transport driver will be loaded according to the users messaging configuration. See 3 for more details on configuration options.

An inspector worker will implement a separate oslo.service process with its own pool of green threads. The worker will periodically consume and handle messages from the clients.

RPC reliability

For each message sent by the client via cast (asynchronously), an acknowledgement is sent back immediately and the message is removed from the queue. As result there is no guarantees that worker will handle the introspection task.

This model, known as at-most-once-delivery doesn’t guarantee message processing for asynchronous tasks if proceed worker dies. Supporting HA may require some additional functionality to confirm that task message was processed.

If a worker dies (connection is closed or lost) during processing inspection data, the task request message will disappear and the introspection task will hang in processing state till timeout happens.

Alternatives

Implement our own Publisher/Consumer functionality with Kombu library. This approach has some benefits:

  • support at-least-once-delivery semantic. For each message retrieved and handled by a consumer, an acknowledgement is sent back to the message producer. In case this acknowledgement is not received after a certain time amount, the message is resent:

    API               Worker thread
     +                      +
     |                      |
     +--------------------->+
     |                      |
     |             +--------+
     |             |        |
     |      Process|        |
     |      Request|        |
     |             |        |
     |             +------->+
     |         ACK          |
     +<---------------------+
     |                      |
     +                      +
    

    If a consumer dies without sending an ack, the message wasn’t processed and if there are other consumers online at the same time, message will be reprocessed.

On the other hand, these approach has considerable drawbacks:

  • Implementing own Publisher/Consumer. It means complexity of supporting new functionality, lack of supported backends, compared to oslo.messaging, like 0MQ.

  • Worse deployer’s UX. Message backend configuration in inspector will differ from other services (including ironic), which brings some pain to deployers.

Data model impact

None

HTTP API impact

Endpoint /continue will return ACCEPTED instead of OK.

Client (CLI) impact

None

Ironic python agent impact

None

Performance and scalability impact

Proposed change will allow users to scale ironic-inspector, both API and Worker, horizontally after some more work in future, for more details refer to High Availability for Ironic Inspector.

Security impact

The newly introduced services require additional protection. The messaging service, which would be used as the transport layer e.g RabbitMQ, should rely on a transport-level cryptography, see 4 for more details.

Deployer impact

The newly introduced message bus layer will require some message broker to connect the inspector API and workers. The most popular broker implementation used in OpenStack installations is RabbitMQ, see 5 for more details.

To achieve resiliency, multiple API service and worker service instances should be deployed on multiple physical hosts.

There are also new configuration options being added, see 3

Developer impact

Developers will need to consider new architecture and inspector API and Worker communication details when adding new features which are required to be handled as background tasks.

Upgrades and Backwards Compatibility

The current inspector service is a single process, so deployers might need to add more services, newly added inspector Worker, the messaging transport backend (RabbitMQ) . Console script ironic-inspector could be changed to run both API and Worker services with in-memory backend for messaging transport. Which allows to run ironic-inspector in backward compatibility manner - run both services on single host without the message broker.

Implementation

Assignee(s)

  • aarefiev (Anton Arefiev)

Work Items

  • Add base service functionality;

  • Introduce Client/Servers workers;

  • Implement API/Worker managers;

  • Split service into API and Worker;

  • Implement support for these services in Devstack;

  • Use WSGI 6 to implement the API service.

Dependencies

None

Testing

All new functionality would be tested both with functional and unit tests. Already running Tempest tests as well as upgrade tests with Grenade will also cover added features.

Functional tests run both Inspector API and Worker with an in-memory backend.

Having all the work items done will allow to setup multi-node devstack and test Inspector in cluster mode eventually.