Host maintenance strategy¶
https://blueprints.launchpad.net/watcher/+spec/cluster-maintaining
Problem description¶
Sometimes we need to maintain compute nodes, update hardware or software, and so on, without interrupting user’s applications.
Use Cases¶
As an openstack operator, sometimes I want to maintain one compute node without interrupting user’s applications.
Proposed change¶
There will be a new goal and strategy for cluster-maintenance.
Add one new goal - “Cluster Maintenance”
Add one new strategy for this goal - “Host Maintenance”
The new strategy executes as follows
First, get the compute node which needs maintenance. This input parameter is provided by the administrator. Call change_nova_service_state action to set the maintaining node in “maintaining” state (disabled with disable_reason ‘watcher_maintaining’).
Then, call migrate action to migrate all instances on the maintaining node to other nodes. Migrate active instances use “live-migrate” and others use “cold-migrate”. Calculate free cpus/memory/disk of a node to determine whether one instance or all instances from the maintaining node can migrate to. This strategy just consider how to migrate all instances of the maintaining node, further optimization rely on other strategies. There are two methods to migrate the instances of the maintaining node: Method No.1, migrate all instances on the maintaining node intensively to one unused host.The ‘unused’ host means disable but not power-off node for Watcher. If there are more than one “unused” hosts, choose one from them by random. (This method won’t result in more VMs migration among other hosts.) Method No.2, just migrate all instances on the maintaining node dispersedly to other nodes. Method No.1 is priority. Only if Method No.1 fails, Method No.2 will execute. If both methods fail, this audit fails and raise exception with no solution produced.
After the maintenance finished, the administrator needs to activate the maintaining node by cli ‘nova service-enable’ to change the node’s state from “maintaining” to “enabled” manually, which will make the compute node rejoin into compute resource.
Alternatives¶
None
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None
Performance Impact¶
None
Other deployer impact¶
None
Developer impact¶
None
Implementation¶
Assignee(s)¶
Primary assignee:sue
Work Items¶
Add strategy and goal for cluster_maintenance
Update change_nova_service_state action, to make it available to maintain one compute node.
Dependencies¶
https://blueprints.launchpad.net/watcher/+spec/extend-node-status
Testing¶
Unit tests
Documentation Impact¶
A documentation explaining how to use this new optimization strategy.
References¶
None
History¶
None