High Availability for Virtual Machines

Problem description

Enterprise customers are moving their application workloads onto OpenStack Cloud. However, not all applications can be re-architected into a Cloud-native model at once. Some applications are deployed on a VM in a Pet model. This requires high availability of such VM. Even though VM volumes can be stored on a shared storage system, such as NFS or Ceph, to improve the availability, VM state on each hypervisor is not easily replicated to other hypervisor. Therefore, the system must be able to recover or rescue the VM from a failure events preferably in an automated and cost effective manner.

User Stories

As a cloud operator, I would like to provide my users with highly reliable VM to meet high SLA requirement. Potentially there are few types of failure events that can occurs with OpenStack Cloud. We need to make sure such events can be detected and recovered by the system. Possible failure events include: * VM is down. * VM provisioning process (nova-compute service) is down. * Host/Hypervisor is down. * Network is down * Attached Cinder Volume failure * Availability Zone/Data Center/Region failure

Usage Scenarios Examples

  • VM is down

Monitor the VM. Detect VM down failure and notify system to recover the VM.

  • VM provisioning process is down

Monitor the provisioning process service (nova-compute service). Detect process failure and notify system to restart the service.

  • Host/Hypervisor is down

Monitor the hypervisor. Detect hypervisor failure and evaculate all VMs from failure host. Restart the VMs on new hosts that enable an application workload to resume a process if the VM state is stored in a volume even though it loses the state on memory. A shared storage can be used for instance volume as these volumes survive outside the hypervisors.

Opportunity/Justification

Many enterprise customers requires HA VM feature in order to satisfy their workload SLA. HA VM is a critical requirements for NTT customers.

Requirements

  • An ability to monitor VM failure.
  • An ability to monitor provisioning process failure.
  • An ability to monitor hypervisor failure.
  • An ability to restart VM due to VM failure.
  • An ability to restart provisioning process.
  • An ability to automatically evacuate VMs from a failure hosts and restart

the VMs on available host.

Gaps

To be determined.

Affected By

To be determined.

Glossary

To be determined.