Enterprise customers are moving their application workloads onto OpenStack Cloud. However, not all applications can be re-architected into a Cloud-native model at once. Some applications are deployed on a VM in a Pet model. This requires high availability of such VM. Even though VM volumes can be stored on a shared storage system, such as NFS or Ceph, to improve the availability, VM state on each hypervisor is not easily replicated to other hypervisor. Therefore, the system must be able to recover or rescue the VM from a failure events preferably in an automated and cost effective manner.
As a cloud operator, I would like to provide my users with highly reliable VM to meet high SLA requirement. Potentially there are few types of failure events that can occurs with OpenStack Cloud. We need to make sure such events can be detected and recovered by the system. Possible failure events include: * VM is down. * VM provisioning process (nova-compute service) is down. * Host/Hypervisor is down. * Network is down * Attached Cinder Volume failure * Availability Zone/Data Center/Region failure
Monitor the VM. Detect VM down failure and notify system to recover the VM.
Monitor the provisioning process service (nova-compute service). Detect process failure and notify system to restart the service.
Monitor the hypervisor. Detect hypervisor failure and evaculate all VMs from failure host. Restart the VMs on new hosts that enable an application workload to resume a process if the VM state is stored in a volume even though it loses the state on memory. A shared storage can be used for instance volume as these volumes survive outside the hypervisors.
Many enterprise customers requires HA VM feature in order to satisfy their workload SLA. HA VM is a critical requirements for NTT customers.
the VMs on available host.
To be determined.
To be determined.
https://github.com/ntt-sic/masakari https://etherpad.openstack.org/p/automatic-evacuation
To be determined.