High Availability for Virtual Machines
======================================

*Problem description*
---------------------
Enterprise customers are moving their application workloads onto OpenStack
Cloud. However, not all applications can be re-architected into a
Cloud-native model at once. Some applications are deployed on a VM in a Pet
model. This requires high availability of such VM. Even though VM volumes can
be stored on a shared storage system, such as NFS or Ceph, to improve the
availability, VM state on each hypervisor is not easily replicated to other
hypervisor. Therefore, the system must be able to recover or rescue the VM
from a failure events preferably in an automated and cost effective manner.

User Stories
------------
As a cloud operator, I would like to provide my users with highly reliable
VM to meet high SLA requirement. Potentially there are few types of failure
events that can occurs with OpenStack Cloud. We need to make sure such events
can be detected and recovered by the system. Possible failure events include:
* VM is down.
* VM provisioning process (nova-compute service) is down.
* Host/Hypervisor is down.
* Network is down
* Attached Cinder Volume failure
* Availability Zone/Data Center/Region failure


Usage Scenarios Examples
------------------------
* VM is down
Monitor the VM. Detect VM down failure and notify system to recover the VM.

* VM provisioning process is down
Monitor the provisioning process service (nova-compute service). Detect
process failure and notify system to restart the service.

* Host/Hypervisor is down
Monitor the hypervisor. Detect hypervisor failure and evaculate all VMs from
failure host. Restart the VMs on new hosts that enable an application
workload to resume a process if the VM state is stored in a volume even
though it loses the state on memory. A shared storage can be used for
instance volume as these volumes survive outside the hypervisors.

Opportunity/Justification
-------------------------
Many enterprise customers requires HA VM feature in order to satisfy their
workload SLA. HA VM is a critical requirements for NTT customers.

Related User Stories
--------------------
To be determined.


*Requirements*
--------------
* An ability to monitor VM failure.
* An ability to monitor provisioning process failure.
* An ability to monitor hypervisor failure.
* An ability to restart VM due to VM failure.
* An ability to restart provisioning process.
* An ability to automatically evacuate VMs from a failure hosts and restart
the VMs on available host.

*Gaps*
------
To be determined.


*Affected By*
-------------
To be determined.

*External References*
---------------------
https://github.com/ntt-sic/masakari
https://etherpad.openstack.org/p/automatic-evacuation

Glossary
--------
To be determined.