Fast-forward upgrades

https://blueprints.launchpad.net/tripleo/+spec/fast-forward-upgrades

Fast-forward upgrades are upgrades that move an environment from release N to N+X in a single step, where X is greater than 1 and for fast-forward upgrades is typically 3. This spec outlines how such upgrades can be orchestrated by TripleO between the Newton and Queens releases.

Problem Description

OpenStack upgrades are often seen by operators as problematic [1] [2]. Whilst TripleO upgrades have improved greatly over recent cycles many operators are still reluctant to upgrade with each new release.

This often leads to a situation where environments remain on the release used when first deployed. Eventually this release will come to the end of its supported life (EOL), forcing operators to upgrade to the next supported release. There can also be restrictions imposed on an environment that simply do not allow upgrades to be performed ahead of the EOL of a given release, forcing operators to again wait until the release hits EOL.

While it is possible to then linearly upgrade to a supported release with the cadence of upstream releases, downstream distributions providing long-term support (LTS) releases may not be able to provide the same path once the initially installed release reaches EOL. Operators in such a situation may also want to avoid running multiple lengthy linear upgrades to reach their desired release.

Proposed Change

Overview

TripleO support for fast-forward upgrades will first target N to N+3 upgrades between the Newton and Queens releases:

Newton    Ocata     Pike       Queens
+-----+   +-----+   +-----+    +-----+
|     |   | N+1 |   | N+2 |    |     |
|  N  | ---------------------> | N+3 |
|     |   |     |   |     |    |     |
+-----+   +-----+   +-----+    +-----+

This will give the impression of the Ocata and Pike releases being skipped with the fast-forward upgrade moving the environment from Newton to Queens. In reality as OpenStack projects with the supports-upgrade tag are only required to support N to N+1 upgrades [3] the upgrade will still need to move through each release, completing database migrations and limited other tasks.

Caveats

Before outlining the suggested changes to TripleO it is worth highlighting the following caveats for fast-forward upgrades:

  • The control plane is inaccessible for the duration of the upgrade
  • The data plane and active workloads must remain available for the duration of the upgrade.

Prerequisites

Prior to the overcloud fast-forward upgrade starting the following prerequisite tasks must be completed:

  • Rolling minor update of the overcloud on N

This is a normal TripleO overcloud update [4] and should bring each node in the environment up to the latest supported version of the underlying OS and pulling in the latest packages. Operators can then reboot the nodes as required. The reboot ensuring that the latest kernel, openvswitch, QEMU and any other reboot dependant package is reloaded before proceeding with the upgrade. This can happen well in advance of the overcloud fast-forward upgrade and should remove the need for additional reboots during the upgrade.

  • Upgrade undercloud from N to N+3

The undercloud also needs to be upgraded to N+3 ahead of any overcloud upgrade. Again this can happen well in advance of the overcloud upgrade. For the time being this is a traditional, linear upgrade between N and N+1 until we reach the target N+3 Queens release.

  • Container images uploaded prior to the start of the upgrade

With the introduction of containerised TripleO overclouds in Pike operators will need to upload the required container images prior to the fast-forward upgrade if they wish to end up with a constainerised Queens overcloud.

High level flow

At a high level the following actions will be carried out by the fast-forward upgrade to move the overcloud from N to N+3:

  • Stop all control and compute services across all roles

This should bring down the control plane but allow any workloads to continue running without interruption.

  • Upgrade a single host from N to N+1 then N+1 to N+2

As alluded to earlier, OpenStack projects currently only support N to N+1 upgrades and so fast-forward upgrades still need to cycle through each release in order to complete data migrations and any other tasks that are required before these migrations can be completed. This part of the upgrade is limited to a single host to ensure this is completed as quickly as possible.

By default a single node of role controller will be selected for this part of the upgrade however this should be fully configurable by the operator driving the upgrade.

  • Upgrade and deployment of all roles to N+3

The final action in the fast-forward upgrade will be a traditional N to N+1 migration between N+2 and N+3 followed by the deployment of all roles on N+3. The final deployment ensuring the overcloud is containerised.

Implementation

As with updates [5] and upgrades [6] specific fast-forward upgrade Ansible tasks associated with the first two actions above will be introduced into the tripleo-heat-template service templates for each service as role_data outputs under fast_forward_upgrade_tasks.

As with upgrade_tasks these tasks will be broken into various steps:

  1. Stop all control-plane services
  2. Quiesce the control-plane
  3. Perform a package update and install new packages
  4. Add and potentially start services needed for migration tasks
  5. Perform any migration tasks, e.g DB sync commands
  6. Optionally validate and cleanup after migration tasks complete

As with update_tasks each task will use a simple when conditional to identify which step and release(s) it is associated with, ensuring these tasks are executed at the correct point in the upgrade.

For example, a step 2 fast_forward_upgrade_task task on Ocata is listed below:

fast_forward_upgrade_tasks:
  - name: Example Ocata step 2 task
    command: /bin/foo bar
    when:
      - step == 2
      - release == 'ocata'

These tasks will then be collated into role specific Ansible playbooks via the RoleConfig output of the overcloud heat template, with step and release variables being fed in to ensure tasks are executed in the correct order. These playbooks should also include the required upgrade_tasks and deploy_tasks as highlighted above to ensure baremetal overclouds are migrated to containers.

As with major upgrades [8] a new mistral workflow and tripleoclient command will be introduced to generate and execute the associated Ansible tasks.

openstack overcloud fast-forward-upgrade --templates [..path to latest THT..] \
                           [..original environment arguments..] \
                           [..new container environment agruments..]

Operators will also be able to generate [7] , download and review the playbooks ahead of time using the latest version of tripleo-heat-templates with the following commands:

openstack overcloud deploy --templates [..path to latest THT..] \
                           [..original environment arguments..] \
                           [..new container environment agruments..] \
                           --setup-heat-outputs
openstack overcloud config download --config-dir ~/overcloud-config

Dev workflow

A new tripleo-quickstart-extras libvirt/snapshot Ansible role will be introduced to snapshot the initial undercloud and overcloud domains when these are deployed using libvirt.

A new tripleo-fast-forward-upgrade Ansible role will be introduced to automate the fast-forward upgrade process, including the initial overcloud minor update, undercloud upgrade to N+3 and fast-forward upgrade itself. Again, snapshots of the undercloud and overcloud domains will be taken between each step when these are deployed using libvirt to hopefully make development less painful.

Developers working on fast_forward_upgrade_tasks will be able to utilise the above roles to snapshot their environments ahead of testing new or modified tasks. Further as these tasks can be rendered as Ansible playbooks developers can also run a subset of these tasks against specific nodes using custom playbooks during their development work. Examples of how to do this will be documented hopefully ensuring a smooth development experience for anyone looking to contribute tasks for specific services.

Alternatives

  • Continue to force operators to upgrade linearly through each major release
  • Parallel cloud migrations.

Security Impact

N/A

Other End User Impact

  • The control plane will be down for the duration of the upgrade
  • The data plane and workloads will remain up.

Performance Impact

N/A

Other Deployer Impact

N/A

Developer Impact

  • Third party service template providers will need to provide fast_forward_upgrade_steps in their THT service configurations.

Implementation

Assignee(s)

Primary assignees: * lyarwood * lbezdick

Other contributors: * shardy

Work Items

  • Introduce fast_forward_upgrades_playbook.yaml to RoleConfig
  • Introduce fast_forward_upgrade_tasks in each service template

Dependencies

N/A

Testing

  • Third party CI jobs will need to be created to test Newton to Queens using RDO given the upstream EOL of stable/newton with the release of Pike.
  • An additional third party CI job will be required to verify that a Queens undercloud can correctly manage a Newton overcloud, allowing the separation of the undercloud upgrade and fast-forward upgrade discussed under prerequisites.

Documentation Impact

  • This will require extensive developer and user documentation to be written, most likely in a new section of the docs specifically detailing the fast-forward upgrade flow.