This blueprint suggests reworking the Ironic provisioning state machine to fix some current shortcomings and to make it easier for drivers and external orchestration agents to manage nodes in Ironic.
NOTE: This blueprint describes the functionality we intend the new state machine to have. Actual implementation of this spec, including detailed upgrade paths and technical arcana will be handled by other specs.
The current Ironic state machine has a few shortcomings:
Current state machine:
NOSTATE//NONE +----------+------+\ [DEPLOYWAIT//DEPLOYDONE] ^ R:active + ^ | | | + v v [DELETING//DELETED] +--->[DEPLOYING//DEPLOYDONE] + ^ | + + | | R:rebuild| | | v | | | v ERROR//NONE | | | DEPLOYFAIL//NONE | | v | +---+ACTIVE//NONE | R:deleted + +---------------------------+
Legend for the current state machine:
Ironic’s API presents two fields for the provision_state of a node: current and target. Thus, in this diagram, all states are represented as “CURRENT-slash-slash-TARGET” state.
Descriptions of the states for the current state machine can be found here <https://github.com/openstack/ironic/blob/stable/icehouse/ironic/common/states.py>.
New state machine:
ENROLL -----------> [VERIFY*/MANAGEABLE] R:manage | v +------>MANAGEABLE<--------+ | + + ^ | | | R:clean| | | |R:inspect | + | | | | + [CLEAN*/MANAGEABLE]<----+ | | +---->[INSPECT*/MANAGEABLE] | | R:provide| +----------+ +-------+ R:manage | v + [CLEAN*/AVAILABLE]+------->AVAILABLE ^ + | |R:active + v [DELET*/AVAILABLE] [DEPLOY*/ACTIVE] ^ + ^ |R:delete | |R:rebuild | v + +------------------+ACTIVE+-----------+ | ^ |R:rescue | | v | + [RESCU*/RESCUE] | [UNRESCU*/ACTIVE] + | ^ | | |R:unrescue | | | v +----------------------+----------+RESCUE
Legend for the new state machine:
STATE* indicates an active state, a momentary state, and a fail state. The active state has an -ING suffix, the momentary state has a -ED suffix, and the fail state has a -FAIL suffix. In the active state, Ironic is doing something to the node.
TARGET indicates the target state that Ironic will try to transition the node to from the active state. TARGET must be a stable state.
Descriptions of the new states:
Once Ironic has verified that it can manage the node using the driver and credentials passed in at node create time, the node will be transitioned to MANAGEABLE and (optionally) powered off. From MANAGEABLE, nodes can transition to:
Nodes in the CLEANING state are being scrubbed in preparation to being made AVAILABLE. Good candidates for CLEANING tasks include:
No matter what tasks are performed during CLEANING, the apparent configuration of the system must not change. For instance, if you tear down a set of RAID volumes to securely erase each physical disk separately, you must rebuild the RAID volumes you tore down.
When a node is in CLEANING state it means that the conductor is executing the clean step (out-of-band) or preparing the environment (building PXE configuration files, configuring the DHCP, etc..) to boot the ramdisk.
Just like the CLEANING state, the nodes in CLEANWAIT are being prepared to become AVAILABLE. The difference is that in CLEANWAIT the conductor is waiting for the ramdisk to boot or the clean step which is running in-band to finish.
The cleaning process of a node in CLEANWAIT can be interrupted via the abort API call.
Nodes in the AVAILABLE state are cleaned, preconfigured, and ready to be provisioned. From AVAILABLE, nodes can transition to:
Nodes in DEPLOYING are being actively prepared to run a workload on them. This should mainly consist of running a series of short-lived tasks, such as:
Tasks for DEPLOYING should be handled in a manner similar to how they are handled for CLEANING (details to be addressed in a different spec).
Just like the DEPLOYING state, the nodes in DEPLOYWAIT are being deployed. The difference is that in DEPLOYWAIT the conductor is waiting for the ramdisk to boot or execute parts of the deployment which needs to run in-band on the node (for example, installing the bootloader, writing the image to the disk when iSCSI is not used, etc...).
The deployment of a node in DEPLOYWAIT provision state can be interrupted via the deleted API call.
Nodes in ACTIVE have a workload running on them. Ironic may collect out-of-band sensor information (including power state) on a regular basis, but will otherwise leave them alone. Nodes in ACTIVE can transition to:
RESCUE exists to allow Ironic to be aware of a node that would be otherwise running a workload, but that is booted into a different operating environment for maintenance or troubleshooting reasons. From RESCUE, nodes can transition to:
No reasonable ones that we could think of at the summit.
Under the current state machine, NOSTATE is represented by a NULL in the database. This will require a database migration to change all NULLs to “AVAILABLE” along with special-case API handling during the migration. The additional states should not require changes to the data model.
We will provide the following verbs to manage the node lifecycle in the state machine:
|Verb||Initial State||Intermediate States||End State|
|manage||ENROLL||VERIFYING -> VERIFIED||MANAGEABLE|
|clean||MANAGEABLE||CLEANING -> CLEANED||MANAGEABLE|
|inspect||MANAGEABLE||INSPECTING -> INSPECTED||MANAGEABLE|
|provide||MANAGEABLE||CLEANING -> CLEANED||AVAILABLE|
|active||AVAILABLE||DEPLOYING -> DEPLOYED||ACTIVE|
|rebuild||ACTIVE||DEPLOYING -> DEPLOYED||ACTIVE|
|rescue||ACTIVE||RESCUING -> RESCUED||RESCUE|
|unrescue||RESCUE||UNRESCUING -> UNRESCUED||ACTIVE|
|deleted||ACTIVE||DELETING -> DELETED -> CLEANING -> CLEANED||AVAILABLE|
|deleted||RESCUE||DELETING -> DELETED -> CLEANING -> CLEANED||AVAILABLE|
|deleted||DEPLOYWAIT||DELETING -> DELETED -> CLEANING -> CLEANED||AVAILABLE|
The API will remain backwards compatible with the active, rebuild, and delete verbs.
Unless otherwise required for backwards compatibility, the verbs must be called when the node is in the Initial State, and Ironic will perform all actions and transitions needed to move through the Intermediate States to the End State.
Since we are adding new states, older API clients may behave unexpectedly when they encounter a node in a state they do not understand.
Not as a direct impact of this spec (beyond what is mentioned in the REST API impact section), but all the to-be-written specs which will actually implement the new states will have significant RPC and REST api impact.
Yes. Large swaths of driver code will need a refactor to cooperate with the new per-node state machines.
NOSTATE has been renamed to AVAILABLE. This will require some glue code and creating an upgrade path.
Probably not, assuming perfect coding.
Probably nothing significant.
Nodes will not automatically transition from ENROLL to MANAGEABLE. Deployers must assign drivers and add credentials to the node and then call the manage API before Ironic can manage the node.
Nodes will not automatically transition from MANAGEABLE to AVAILABLE, deployers will need to do that via the API before nodes can be scheduled.
Current and new Ironic drivers will need rework to comply with the new state machine.
Specs need written to hash out the implementation details that the new state machine implies.
Most every blueprint that touches on the Ironic drivers will be affected, but this blueprint is vendor-agnostic.
None for this spec, but the implementation specs will need to address testing impacts of the changes they recommend.
None for this spec, but the implementation specs will need to address upgrade and backwards compatibility.
This spec should be used as initial documentation for the new state machine.