Support server power state update through external event¶
https://blueprints.launchpad.net/nova/+spec/nova-support-instance-power-update
This spec aims at providing more flexibility for operators regarding the
_sync_power_states
periodic task (which aligns the server states
between the database and the hypervisor) in nova with respect to use cases for
the baremetal instances (ironic). It proposes to make this periodic power
sync’s “source of truth” configurable, depending on situations, like to allow
the physical instance to be the source of truth and make nova update its
database rather than enforcing the database state onto the physical instance.
Problem description¶
As a part of this periodic power sync between nova and ironic, when a physical
instance goes down during situations like a power outage or when the hardware
team with direct physical access to the machine does system repairs, the
instance is put into the SHUTDOWN
state by nova in its database since
the hypervisor is regarded as the source of truth. However when the physical
instance comes up again through non-nova-api methods like the IPMI access or
the power button, it will be put into the SHUTDOWN
state again by nova
since the database is regarded as the source of truth here (asynchronous).
This can cause operational inconvenience and inconsistency between
cloud operators and repair teams. Currently the only way to avoid this is by
completely disabling the power synchronisation which is not recommended.
Note that ironic allows a node to be put into the maintenance mode
by which
that node will be excluded from nova’s _sync_power_states
periodic task.
This covers predictable events like scheduled repairs but does not help with
unforseen events such as power failures.
Use Cases¶
As an operator I would like to have my physical instance’s power state as
RUNNING
and not be put in SHUTDOWN
by nova once it comes back up after
a system repair or a power outage via IPMI access or direct physical access.
Proposed change¶
To make nova hear the physical instance come up (or go down) and regard it as
the source of truth, the idea is to add a power-update
event name to the
os-server-external-events
nova API. This event will be sent by ironic
whenever there is a change in the power state of the down physical instance
i.e. when the physical instance comes up (or goes down) on the ironic side
and ironic trusts the hardware instead of the database as the source of
truth. Nova will be listening for the power-update
event from ironic
using the existing external-events API endpoint as discussed in the
nova-ironic cross project session at the Denver2018 PTG.
On the nova side, once such an event for a physical instance is received from
ironic, it will be routed to the virt driver. In the virt driver we will add a
new driver.power_update_event
method which will be in a NotImplemented
state for all driver types except ironic. So if we receive a power-update for
an instance backed by a non-ironic driver we will log an error. In the ironic
driver this method will update the vm_state
and power_state
fields of
that instance to ACTIVE
and RUNNING
(or STOPPED
and SHUTDOWN
)
in the nova database. Note that before routing the call to the driver the
notifications and instance actions for the power update will be handled by nova
similar to the normal start/stop operations.
Even with this proposed change, depending on the order of occurrence of events
we could still have race conditions where the periodic task is already running
and it overrides the power-update
event. However this window is quite
small. To avoid the periodic task and power-update event from stepping over
each other a lock can be shared between them.
Alternatives¶
There have been failed attempts at fixing this problem in the past like allowing admins to decide what action to take when the states conflict or allowing admins to reboot instances when the states conflict.
Data model impact¶
A new event name will be added to objects.InstanceExternalEvent.name
enum
called power-update
.
REST API impact¶
The proposed JSON request body for the new “power-update” event is:
{
"events": [
{
"name": "power-update",
"server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3",
"tag": target_power_state
}
]
}
Definition of fields:
- name
Name of the event. (“power-update” for this feature).
- server_uuid
Server UUID of the physical instance whose power_state needs to be updated in the database.
- tag
The target_power_state values will either be “POWER_ON” (which maps to “RUNNING” in nova) or “POWER_OFF” (which maps to “SHUTDOWN” in nova).
The proposed JSON response body for the new “power-update” event is:
{
"events": [
{
"code": 200,
"name": "power-update",
"server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3",
"status": "completed",
"tag": target_power_state
}
]
}
Definition of fields:
- name
Name of the event. (“power-update” for this feature).
- status
- Event status. Possible values:
“completed” if accepted by Nova
“failed” if a failure is encountered
- code
- Event result code. Possible values:
200 means accepted
400 means the request is missing required parameter
404 means the server could not be found
422 means the event cannot be processed because the instance was found to not be associated to a host.
- server_uuid
Same value as provided in original request.
- tag
Same value as provided in original request.
This powering up/down of instances on the nova side will be made visible
through the GET /servers/{server_id}/os-instance-actions
and
GET /servers/{server_id}/os-instance-actions/{request_id}
API calls for the
users (by default admins and owners of the server).
Security impact¶
None.
Notifications impact¶
None.
Other end user impact¶
None
Performance Impact¶
None
Other deployer impact¶
None
Developer impact¶
None
Upgrade impact¶
None
Implementation¶
Assignee(s)¶
- Primary assignee:
<tssurya>
- Other contributors:
<wiebalck>
Work Items¶
Add the new external-event type.
Make the necessary changes in the compute API and manager for the update of the power and vm states of the instance on receiving an event from ironic.
Add the new microversion and config option.
Dependencies¶
The client side changes needed for the events to be sent by ironic when the physical instance comes up or goes down.
Testing¶
Unit and functional tests to verify the new power-update
event’s working.
Documentation Impact¶
Update the compute API reference documentation with the new power-update event.
References¶
History¶
Release Name |
Description |
---|---|
Train |
Introduced |