Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/fuel/+spec/graceful-stop-restart-deployment
As an operator I want to be able to stop the deployment process and restart it so that I can change erroneous configuration or fix environmental or infrastructural issues whichever arise and start the deployment again.
Examples are:
- Some nodes failed during OS provisioning due to some floating bug or even some unknown reason
- Some nodes gone offline during the deployment due to intermittent connectivity issue
- Operator discovers that he needs to adjust or correct cluster settings, networks, plugins, enabled services, etc.
For all cases of such kind the following UX must be made available:
- User faces a case when cloud deployment needs to be stopped and some additional measures taken to assure it’s further success
- User presses “Stop deployment” button in the UI
- User applies changes required to prevent the failure - fixes the servers, makes changes to deployment config parameters, etc
- User presses “Deploy Changes” button
Fuel proceeds with the deployment, taking into consideration particular stage of the deployment that the cluster has reached already (OS provisioned), with all tasks being re-ran on the corresponding nodes
Currently Fuel has a really buggy implementation of “Stop Deployment” functionality which actually resets the cluster and breaks real life-cycle management scenarios because if you stop the deployment during compute addition this will actually destroy the cluster completely. With task-based deployment and tasks history feature implementation it should be relatively easy.
New node status ‘stopped’ is going to be introduced as well as a composite cluster status ‘partially_deployed’ is going to be introduced. Graceful cluster stop will send a signal to the orchestrator to inform it to stop further deployment graph traversal and report corresponding statuses.
Its place in current cluster and nodes state machine is described here:
Status stopped should be supported on UI side
New node status ‘stopped’ is going to be introduced. Also, Nailgun rpc receiver is going to be altered to support ‘stopped’ task status.
None
None
Orchestrator will support new status ‘stopped’ for the nodes, will wait for particular deployment engine to finish its execution on all the running nodes and report the status back to Nailgun. Instead of classic stop deployment now orchestrator stop to process new tasks, but allow to end already running tasks.
RPC receiver in Nailgun and Astute should support ‘stop deployment’ signal
None
None
None
None
Supported only by 9.0 clusters.
None
None
Ability to stop the cluster without ruining it
None
None
The same as user’s - ability to stop things, change something and start thus increasing development velocity.
None
“Stop Deployment” action documentation should be updated
Related to deployment tasks history feature [0]
We need to cover the new Stop/Restart behavior by the test cases according to acceptance criteria
Deployment of the cluster should simply wait for exit of particular deployment tasks executors and report back to Nailgun. User should be able to successfully restart by running regular cluster actions which should not fail to any possible artifacts introduced by deployment stop action.