TripleO - Ansible upgrade Worklow with UI integration

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/tripleo/+spec/major-upgrade-workflow

During the Pike cycle the minor update and some parts of the major upgrade are significantly different to any previous cycle, in that they are not being delivered onto nodes via Heat stack update. Rather, Heat stack update is used to only collect, but not execute, the relevant ansible tasks defined in each of the the service manifests as upgrade_tasks or update_tasks accordingly. These tasks are then written as stand-alone ansible playbooks in the stack outputs.

These ‘config’ playbooks are then downloaded using the openstack overcloud config download utility and finally executed to deliver the actual upgrade or update. See bugs 1715557 and 1708115 for more information (or pointers/reviews) about this mechanism as used during the P cycle.

For Queens and as discussed at the Denver PTG we aim to extend this approach to include the controlplane upgrade too. That is, instead of using HEAT SoftwareConfig and Deployments to invoke ansible we should instead collect the upgrade_tasks for the controlplane nodes into ansible playbooks that can then be invoked to deliver the actual upgrade.

Problem Description

Whilst it has continually improved in each cycle, complexity and difficulty to debug or understand what has been executed at any given point of the upgrade is still one of the biggest complaints from operators about the TripleO upgrades workflow. In the P cycle and as discussed above, the minor version update and some part of the ‘non-controller’ upgrade have already moved to the model being proposed here, i.e. generate ansible-playbooks with an initial heat stack update and then execute them.

If we are to use this approach for all parts of the upgrade, including for the controlplane services then we will also need a mistral workbook that can handle the download and execution of the ansible-playbook invocations. With this kind of ansible driven workflow, executed by mistral action/workbook, we can for the first time consider integration with the UI for upgrade/updates. This aligns well with the effort by the UI team for feature parity in CLI/UI for Queens. It should also be noted that there is already some work underway to adding the required mistral actions, at least for the minor update for Pike deployments in changes 487488 and 487496

Implementing a fully ansible-playbook delivered workflow for the entire major upgrade workflow will offer a number of benefits:

  • very short initial heat stack update to generate the playbooks
  • easier to follow and understand what is happening at a given step of the upgrade
  • easier to debug and re-run any particular step of the upgrade
  • implies full python-tripleoclient and mistral workbook support for the ansible-playbook invocations.
  • can consider integrating upgrades/updates into the UI, for the first time

Proposed Change

We will need an initial heat stack update to populate the upgrade_tasks_playbook into the overcloud stack output (the cli is just a suggestion):

The first step of the upgrade will be used to deliver any required common upgrade initialisation, such as switching repos to the target version, installing any new packages required during the upgrade, and populating the upgrades playbooks.

Then the operator will run the upgrade targeting specific nodes:

  • openstack overcloud upgrade –nodes [overcloud-novacompute-0, overcloud-novacompute-1] or openstack overcloud upgrade –nodes “Compute”

Download and execute the ansible playbooks on particular specified set of nodes. Ideally we will make it possible to specify a role name with the playbooks being invoked in a rolling fashion on each node.

One of the required changes is to convert all the service templates to have ‘when’ conditionals instead of the current ‘stepN’. For Pike we did this in the client to avoid breaking the heat driven upgrade workflow still in use for the controlplane during the Ocata to Pike upgrade. This will allow us to use the ‘ansible-native’ loop control we are currently using in the generated ansible playbooks.

Other End User Impact

There will be significant changes to the workflow and cli the operator uses for the major upgrade as documented above.

Performance Impact

The initial Heat stack update will not deliver any of the puppet or docker config to nodes since the DeploymentSteps will be disabled as we currently do for Pike minor update. This will mean a much shorter heat stack update - exact numbers TBD but ‘minutes not hours’.

Developer Impact

Should make it easier for developers to debug particular parts of the upgrades workflow.

Implementation

Assignee(s)

Contributors: Marios Andreou (marios) Mathieu Bultel (matbu) Sofer Athlang Guyot (chem) Steve Hardy (shardy) Carlos Ccamacho (ccamacho) Jose Luis Franco Arza (jfrancoa) Marius Cornea (mcornea) Yurii Prokulevych (yprokule) Lukas Bezdicka (social) Raviv Bar-Tal (rbartal) Amit Ugol (amitu)

Work Items

  • Remove steps and add when for all the ansible upgrade tasks, minor update tasks, deployment steps, post_upgrade_tasks
  • Need mistral workflows that can invoke the required stages of the workflow (–init and –nodes). There is some existing work in this direction in 463765.
  • CLI/python-tripleoclient changes required. Related to the previous item there is some work started on this in 463728.
  • UI work - we will need to collaborate with the UI team for the integration. We have never had UI driven upgrade or updates.
  • CI: Implement a simple job (one nodes, just controller, which does the heat-setup-output and run ansible –nodes Controller) with keystone only upgrade. Then iterate on this as we can add service upgrade_tasks.
  • Docs!

Testing

We will aim to land a ‘keystone-only’ job asap which will be updated as the various changes required to deliver this spec are closer to merging. For example we may deploy only a very small subset of services (e.g. first keystone) and then iterate as changes related to this spec are proposed.

Documentation Impact

We should also track changes in the documented upgrades workflow since as described here it is going to change significantly both internally as well as the interface exposed to an operator.

References

Check the source for links