Add maintenance reason field

https://blueprints.launchpad.net/ironic/+spec/maintenance-reason

When a node is put into maintenance (manually or automatically), Ironic and the operator should know why.

Problem description

Ironic has the ability to mark a node in “maintenance” mode, to be ignored for the purposes of scheduling and verifying state. However:

  • When Ironic automatically puts a node into maintenance mode, it sets the reason in the last_error field, which may get overwritten by other tasks later.

  • When an operator manually puts a node into maintenance mode, they have no method to show why it was put into maintenance, for other operators or to remind themselves later.

Proposed change

The following should be enough to solve this problem:

  • A maintenance_reason field should be added to the nodes table, as the canonical place to store the reason the node was put into maintenance mode. This should be an internal attribute not directly editable by calling the node.update API.

  • A new API endpoint should be added to more easily manage maintenance mode. This endpoint can toggle maintenance mode on or off, with an optional reason for ‘on’, and clearing the reason when toggled ‘off’. Changing maintenance mode using the old methods should still be allowed for backwards compatibility.

  • Modify node.update to clear the maintenance reason when turning maintenance mode off via node.update API.

Alternatives

Alternatively, operators could store this in another system, such as a CMDB.

While I think this would be fine, this would not allow for Ironic to automatically set a maintenance reason when putting a node into maintenance mode. Work would need to be done to make Ironic notify the operator or integrate with the other system; and possibly cause the operator to do manual work to put the reason in the other system.

Data model impact

This will add a maintenance_reason field to the node table, with an accompanying database migration. This field will default to NULL, which will also be the value when there is no reason, or when maintenance reason is cleared via the new API.

REST API impact

One new endpoint will be added, with two methods:

  • PUT /v1/nodes/<uuid>/maintenance

    • Puts a node into maintenance mode, with an optional reason.

    • Method type: PUT

    • Normal response code: 202

    • Expected errors:

      • 404 if the node with <uuid> does not exist.

      • 400 if a conductor for the node’s driver cannot be found.

    • URL: /v1/nodes/<uuid>/maintenance

    • URL parameters: None.

    • JSON body: {“reason”: “Some reason.”}, or {} or empty for no reason.

    • Response body is empty if successful.

  • DELETE /v1/nodes/<uuid>/maintenance

    • Takes a node out of maintenance mode and clears the reason.

    • Method type: DELETE

    • Normal response code: 202

    • Expected errors:

      • 404 if the node with <uuid> does not exist.

      • 400 if a conductor for the node’s driver cannot be found.

    • URL: /v1/nodes/<uuid>/maintenance

    • URL parameters: None.

    • JSON body: None.

    • Response body is empty if successful.

The maintenance_reason field should be added to the node details API.

RPC API impact

None.

Driver API impact

None.

Nova driver impact

None.

Security impact

None.

Other end user impact

Support for this will be added in python-ironicclient. The CLI will look like:

usage: ironic node-set-maintenance [--reason <reason>]
                                   <node id> <maintenance mode>

Set maintenance mode on or off.

Positional arguments:
  <node id>           UUID of node
  <maintenance mode>  Supported states: 'on' or 'off'

Optional arguments:
  --reason <reason>   The reason for setting maintenance mode to "on"; not
                      valid when setting to "off".

Scalability impact

None.

Performance Impact

None.

Other deployer impact

Deployers may wish to start using this feature when it is deployed; however there should be no impact otherwise.

Developer impact

None.

Implementation

Assignee(s)

Primary assignee:

jroll

Other contributors:

lucasagomes

Work Items

  • Add maintenance_reason to the nodes table with a migration.

  • Set maintenance_reason when automatically setting maintenance mode.

  • Add the new API endpoints.

  • Clear maintenance_reason when using node.update to set maintenance mode off.

  • Add client support for the new API endpoints.

  • Add Tempest tests for the new API endpoints.

Dependencies

None.

Testing

Tempest tests should be added for the new API endpoints.

Upgrades and Backwards Compatibility

This change will be backwards compatible with existing clients, as they may still use the node.update call to set maintenance on or off. Updating via the node.update call will not be deprecated in v1, since there isn’t any reasonable programmatic way to inform users of its deprecation. It will be deprecated in v2.

To avoid having an outdated maintenance reason, using the node.update call to set maintenance mode off will clear the maintenance reason.

Documentation Impact

The new API endpoints and client methods should be documented.

References

None.