When a node has finished a workload, driver interfaces should have the opportunity to run a set of tasks immediately after tear down and before the node is available for scheduling again.
The cleaning features will be added behind a config option, to ensure operators have the choice to disable the feature if it is unnecessary for their deployment. For example, some operators may want to disable cleaning on every request and only clean occasionally via ZAPPING.
Add a decorator @clean_step(priority) to decorate steps that should be run as a part of CLEANING. priority is the order in which the step will be run. The function with the highest priority will run first, followed by the one with the second highest priority, etc. If priority is set to 0, the step will not be executed. The argument should be a config option, e.g. priority=CONF.$interface.$stepname_priority to give the operator more control over the order steps run in (if at all).
Add a new function get_clean_steps() to the base Interface classes. The base implementation will get a list of functions decorated with @clean_step, determine which are enabled, and then return a list of dictionaries representing each step, sorted by priority.
The return value of get_clean_steps() will be a list of dicts with the 3 keys: step, priority and interface, described below:
- ‘step’: ‘function_name’,
- ‘priority’: ‘an int or float, used for sorting, described below’,
- ‘interface’: ‘interface_name
Only steps with a priority greater than 0 (enabled steps) will be returned.
Add a new function execute_clean_step(clean_step) to the base Interfaces, which takes one of the dictionaries returned by get_clean_steps() as an arg, and execute the specified step.
Create a new function in the conductor: clean(task) to run all enabled clean steps. It will get a list of all enabled steps and execute them by priority. The conductor will track the current step in a new field on the node called clean_step.
In the event of a tie for priority, the tie breaker will be the interface implementing the function, in the order power, management, deploy interfaces. So if the power and deploy interface both implement a step with priority 10, power’s step will be executed first, then the deploy interface’s step.
If there is a tie for priority within a single interface (an operator inadvertently sets two to the same priority), the conductor will fail to load that interface while starting up, and log errors about the overlapping priorities.
Using CLEANING, CLEANED, and CLEANFAIL that will be added in the new state machine spec . These states occur between DELETED and AVAILABLE. This will prevent Nova delete commands from taking hours.
CLEANED will be used much like DELETED: generally as a target provision state. A node will be in CLEANED state after CLEANING completes and until the conductor gets a chance to move it to AVAILABLE.
Nodes may be put into CLEANING via an API call (described below) only from MANAGED or CLEANFAIL states. MANAGED allows an operator to clean a node before it is available for scheduling. This ensures new nodes are at the same baseline as other, already added nodes.
The ZAPPING API will allow nodes to go through a single or list of clean_steps from the MANAGED state. These will be operator driven steps via the API, as opposed to the automated CLEANING that occurs after tear_down described in this spec.
Make the Nova Virt Driver look for CLEANING, CLEANED, and CLEANFAIL states in _wait_for_provision_state() so the node can be removed from a users list of active nodes more quickly. Failures to clean should not be errors for the user and need to be resolved by an operator.
If a clean fails, the node will be put into CLEANFAIL state, have last_error set appropriately, and be put into maintenance. The node will not be powered off, as a power cycle could damage a node. The operator can then fix the node, and put the node’s target_provision_state back to CLEANED via the API to retry cleaning or skip to AVAILABLE.
CLEANING will not be performed on rebuilds.
Cleaning of a node will need to be available via RPC, so the API servers can put a node into CLEANING from MANAGED or CLEANFAIL states.
At the end of a tear down, the conductor will RPC call() the do_node_clean() method of the conductor.
As the states will first be added as no-ops in the new state machine spec, upgrading won’t be a problem.
The BaseDriver will have a get_clean_steps() and execute_clean_steps() functions added and implemented.
“”“Return the clean steps this interface can perform on a node”“”
|param task:||a task from TaskManager.|
|returns:||a list of dictionaries as noted above|
“”“Execute the given clean step on the task.node”“”
|param task:||a task from TaskManager.|
|param step:||a step from get_clean_steps()|
|if the step fails|
Testing will be similar to other driver interfaces and each interface will be expected to test their implementation thoroughly.
Existing interfaces can choose to not implement the new API with no effect, as they will be added in the base classes.