Uniform Airflow Migration Strategy¶
Airflow (Unit: CFM) is a cooling related telemetry which can be used to measure the cooling status of server.
This spec proposes a new Watcher migration strategy based on the airflow of servers. This strategy makes decisions to migrate VMs to make the airflow uniform.
Note: “server” in this document means “hypervisor”.
In current Data Center infrastructure, the cooling air supply to servers can be different , depends on the workload and inlet temperature. When a server is overloaded or the supply air is too hot, the airflow can reach the threshold. In this case, we need to move the VM instance to other servers.
As an administrator, I want to be able to trigger an audit that controls the airflow and perform VM instance load balancing.
In order to:
Reduce the total power consumption spent on cooling.
Increase the lifespan of the data center because cooling effectiveness is a first order factor.
Not relevant because Watcher is not in the big tent so far.
Watcher already has its decision framework, so this strategy should be a new class which extend the base strategy class.
Set the threshold through the template. see: https://blueprints.launchpad.net/watcher/+spec/optimization-threshold
Create a new Python class to extend the “BaseStrategy” class.
Use the Telemetry client to get Airflow, Inlet temperature, System Power metrics of hypervisors.
Use the Nova objects framework to get free CPU/Memory/Disk of hypervisors.
An algorithm to detect if the threshold of Airflow has been reached, by default it uses the average value in 5 minutes to compare with the threshold ,and can be configurable like threshold. Here it needs 3 thresholds: Airflow, Inlet temperature, and System power. When the threshold of Airflow has been reached, it will check:
Whether the threshold of both Inlet temperature and System power been reached, if so, it will choose the first VM to migrate.
Whether the inlet temperature and system power both lower than threshold if so, it means there might be someting wrong with the hardware, it will migrate all VMs of the hypervisor.
At the end, it will filter the viable targets according to the free resource information of hypervisors from previous step.
Data model impact¶
REST API impact¶
Other end user impact¶
There used to be some performance issues regarding the query of metrics from the Telemetry database. This is one of the reasons why it was rarely used in production environment. These issues may now be solved thanks to an abstraction layer which enables anybody to change the underlying metrics storage backend easily. There is also a performance issue when you query the Nova DB to get CPU usage metrics.
Other deployer impact¶
- Primary assignee:
function to use Telemetry client to get Airflow,Inlet temperature,System power of hypervisors.
function to filter servers by Nova basic metrics(free CPU/Memory/Disk).
Rewrite execute function to add the algorithm to detect the threshold and to choose the target hypervisors, generate action plans.
Unit tests and functional test, will use a fake metrics set for running functional test.
A documentation explaining how to use this new optimization strategy.