Support server power state update through external event

https://blueprints.launchpad.net/nova/+spec/nova-support-instance-power-update

This spec aims at providing more flexibility for operators regarding the _sync_power_states periodic task (which aligns the server states between the database and the hypervisor) in nova with respect to use cases for the baremetal instances (ironic). It proposes to make this periodic power sync’s “source of truth” configurable, depending on situations, like to allow the physical instance to be the source of truth and make nova update its database rather than enforcing the database state onto the physical instance.

Problem description

As a part of this periodic power sync between nova and ironic, when a physical instance goes down during situations like a power outage or when the hardware team with direct physical access to the machine does system repairs, the instance is put into the SHUTDOWN state by nova in its database since the hypervisor is regarded as the source of truth. However when the physical instance comes up again through non-nova-api methods like the IPMI access or the power button, it will be put into the SHUTDOWN state again by nova since the database is regarded as the source of truth here (asynchronous). This can cause operational inconvenience and inconsistency between cloud operators and repair teams. Currently the only way to avoid this is by completely disabling the power synchronisation which is not recommended.

Note that ironic allows a node to be put into the maintenance mode by which that node will be excluded from nova’s _sync_power_states periodic task. This covers predictable events like scheduled repairs but does not help with unforseen events such as power failures.

Use Cases

As an operator I would like to have my physical instance’s power state as RUNNING and not be put in SHUTDOWN by nova once it comes back up after a system repair or a power outage via IPMI access or direct physical access.

Proposed change

To make nova hear the physical instance come up (or go down) and regard it as the source of truth, the idea is to add a power-update event name to the os-server-external-events nova API. This event will be sent by ironic whenever there is a change in the power state of the down physical instance i.e. when the physical instance comes up (or goes down) on the ironic side and ironic trusts the hardware instead of the database as the source of truth. Nova will be listening for the power-update event from ironic using the existing external-events API endpoint as discussed in the nova-ironic cross project session at the Denver2018 PTG.

On the nova side, once such an event for a physical instance is received from ironic, it will be routed to the virt driver. In the virt driver we will add a new driver.power_update_event method which will be in a NotImplemented state for all driver types except ironic. So if we receive a power-update for an instance backed by a non-ironic driver we will log an error. In the ironic driver this method will update the vm_state and power_state fields of that instance to ACTIVE and RUNNING (or STOPPED and SHUTDOWN) in the nova database. Note that before routing the call to the driver the notifications and instance actions for the power update will be handled by nova similar to the normal start/stop operations.

Even with this proposed change, depending on the order of occurrence of events we could still have race conditions where the periodic task is already running and it overrides the power-update event. However this window is quite small. To avoid the periodic task and power-update event from stepping over each other a lock can be shared between them.

Alternatives

There have been failed attempts at fixing this problem in the past like allowing admins to decide what action to take when the states conflict or allowing admins to reboot instances when the states conflict.

Data model impact

A new event name will be added to objects.InstanceExternalEvent.name enum called power-update.

REST API impact

The proposed JSON request body for the new “power-update” event is:

{
    "events": [
        {
            "name": "power-update",
            "server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3",
            "tag": target_power_state
        }
    ]
}

Definition of fields:

name

Name of the event. (“power-update” for this feature).

server_uuid

Server UUID of the physical instance whose power_state needs to be updated in the database.

tag

The target_power_state values will either be “POWER_ON” (which maps to “RUNNING” in nova) or “POWER_OFF” (which maps to “SHUTDOWN” in nova).

The proposed JSON response body for the new “power-update” event is:

{
    "events": [
        {
            "code": 200,
            "name": "power-update",
            "server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3",
            "status": "completed",
            "tag": target_power_state
        }
    ]
}

Definition of fields:

name

Name of the event. (“power-update” for this feature).

status
Event status. Possible values:
  • “completed” if accepted by Nova

  • “failed” if a failure is encountered

code
Event result code. Possible values:
  • 200 means accepted

  • 400 means the request is missing required parameter

  • 404 means the server could not be found

  • 422 means the event cannot be processed because the instance was found to not be associated to a host.

server_uuid

Same value as provided in original request.

tag

Same value as provided in original request.

This powering up/down of instances on the nova side will be made visible through the GET /servers/{server_id}/os-instance-actions and GET /servers/{server_id}/os-instance-actions/{request_id} API calls for the users (by default admins and owners of the server).

Security impact

None.

Notifications impact

None.

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:

<tssurya>

Other contributors:

<wiebalck>

Work Items

  1. Add the new external-event type.

  2. Make the necessary changes in the compute API and manager for the update of the power and vm states of the instance on receiving an event from ironic.

  3. Add the new microversion and config option.

Dependencies

  • The client side changes needed for the events to be sent by ironic when the physical instance comes up or goes down.

Testing

Unit and functional tests to verify the new power-update event’s working.

Documentation Impact

Update the compute API reference documentation with the new power-update event.

References

History

Revisions

Release Name

Description

Train

Introduced