Instance High Availability

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/tripleo/+spec/instance-ha

A very often requested feature by operators and customers is to be able to automatically resurrect VMs that were running on a compute node that failed (either due to hardware failures, networking issues or general server problems). Currently we have a downstream-only procedure which consists of many manual steps to configure Instance HA: https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/paged/high-availability-for-compute-instances/chapter-1-overview

What we would like to implement here is basically an optional opt-in automatic deployment of a cloud that has Instance HA support.

Problem Description

Currently if a compute node has a hardware failure or a kernel panic all the instances that were running on the node, will be gone and manual intervention is needed to resurrect these instances on another compute node.

Proposed Change

Overview

The proposed change would be to add a few additional puppet-tripleo profiles that would help us configure the pacemaker resources needed for instance HA. Unlike in previous iterations we won’t need to move nova-compute resources under pacemaker’s management. We managed to achieve the same result without touching the compute nodes (except by setting up pacemaker_remote on the computes, but that support exists already)

Alternatives

There are a few specs that are modeling host recovery:

Host Recovery - https://review.openstack.org/#/c/386554/ Instances auto evacuation - https://review.openstack.org/#/c/257809

The first spec uses pacemaker in a very similar way but is too new and too high level to really be able to comment at this point in time. The second one has been stalled for a long time and it looks like there is no consensus yet on the approaches needed. The longterm goal is to morph the Instance HA deployment into the spec that gets accepted. We are actively working on both specs as well. In any case we have discussed the long-term plan with SuSe and NTT and we agreed on a long-term plan of which this spec is the first step for TripleO.

Security Impact

No additional security impact.

Other End User Impact

End users are not impacted except for the fact that VMs can be resurrected automatically on a non-failed compute node.

Performance Impact

There are no performance related impacts as compared to a current deployment.

Other Deployer Impact

So this change does not affect the default deployments. What it does it adds a boolean and some additional profiles so that a deployer can have a cloud configured with Instance HA support out of the box.

  • One top-level parameter to enable the Instance HA deployment
  • Although fencing configuration is already currently supported by tripleo, we will need to improve bits and pieces so that we won’t need an extra command to generate the fencing parameters.
  • Upgrades will be impacted by this change in the sense that we will need to make sure to test them when Instance HA is enabled.

Developer Impact

No developer impact is planned.

Implementation

Assignee(s)

Primary assignee:
michele
Other contributors:
cmsj, abeekhof

Work Items

  • Make the fencing configuration fully automated (this is mostly done already, we need oooq integration and some optimization)
  • Add the logic and needed resources on the control-plane
  • Test the upgrade path when Instance HA is configured

Testing

Testing this manually is fairly simple:

  • Deploy with Instance HA configured and two compute nodes
  • Spawn a test VM
  • Crash the compute node where the VM is running
  • Observe the VM being resurrected on the other compute node

Testing this in CI is doable but might be a bit more challenging due to resource constraints.

Documentation Impact

A section under advanced configuration is needed explaining the deployment of a cloud that supports Instance HA.