Virt driver pinning guest vCPUs to host pCPUs

https://blueprints.launchpad.net/nova/+spec/virt-driver-cpu-pinning

This feature aims to improve the libvirt driver so that it is able to strictly pin guest vCPUS to host pCPUs. This provides the concept of “dedicated CPU” guest instances.

Problem description

If a host is permitting overcommit of CPUs, there can be prolonged time periods where a guest vCPU is not scheduled by the host, if another guest is competing for the CPU time. This means that workloads executing in a guest can have unpredictable latency, which may be unacceptable for the type of application being run.

Depending on the workload being executed the end user or admin may wish to have control over how the guest used hyperthreads. To maximise cache efficiency, the guest may wish to be pinned to thread siblings. Conversely the guest may wish to avoid thread siblings (ie only pin to 1 sibling) or even avoid hosts with threads entirely.

Proposed change

The flavor extra specs will be enhanced to support two new parameters

  • hw:cpu_policy=shared|dedicated

  • hw:cpu_threads_policy=avoid|separate|isolate|prefer

If the policy is set to ‘shared’ no change will be made compared to the current default guest CPU placement policy. The guest vCPUs will be allowed to freely float across host pCPUs, albeit potentially constrained by NUMA policy. If the policy is set to ‘dedicated’ then the guest vCPUs will be strictly pinned to a set of host pCPUs. In the absence of an explicit vCPU topology request, the virt drivers typically expose all vCPUs as sockets with 1 core and 1 thread. When strict CPU pinning is in effect the guest CPU topology will be setup to match the topology of the CPUs to which it is pinned. ie if a 2 vCPU guest is pinned to a single host core with 2 threads, then the guest will get a topology of 1 socket, 1 core, 2 threads.

The threads policy will control how the scheduler / virt driver place guests wrt CPU threads. It will only apply if the sheduler policy is ‘dedicated’

  • avoid: the scheduler will not place the guest on a host which has hyperthreads.

  • separate: if the host has threads, each vCPU will be placed on a different core. ie no two vCPUs will be placed on thread siblings

  • isolate: if the host has threads, each vCPU will be placed on a different core and no vCPUs from other guests will be able to be placed on the same core. ie one thread sibling is guaranteed to always be unused,

  • prefer: if the host has threads, vCPU will be placed on the same core, so they are thread siblings.

The image metadata properties will also allow specification of the threads policy

  • hw_cpu_threads_policy=avoid|separate|isolate|prefer

This will only be honoured if the flavor does not already have a threads policy set. This ensures the cloud administrator can have absolute control over threads policy if desired.

The scheduler will have to be enhanced so that it considers the usage of CPUs by existing guests. Use of a dedicated CPU policy will have to be accompanied by the setup of aggregates to split the hosts into two groups, one allowing overcommit of shared pCPUs and the other only allowing dedicated CPU guests. ie we do not want a situation with dedicated CPU and shared CPU guests on the same host. It is likely that the administrator will already need to setup host aggregates for the purpose of using huge pages for guest RAM. The same grouping will be usable for both dedicated RAM (via huge pages) and dedicated CPUs (via pinning).

The compute host already has a notion of CPU sockets which are reserved for execution of base operating system services. This facility will be preserved unchanged. ie dedicated CPU guests will only be placed on CPUs which are not marked as reserved for the base OS.

Alternatives

There is no alternative way to ensure that a guest has predictable execution latency free of cache effects from other guests working on the host, that does not involve CPU pinning.

The proposed solution is to use host aggregates for grouping compute hosts into those for dedicated vs overcommit CPU policy. An alternative would be to allow compute hosts to have both dedicated and overcommit guests, splitting them onto separate sockets. ie if there were for sockets, two sockets could be used for dedicated CPU guests while two sockets could be used for overcommit guests, with usage determined on a first-come, first-served basis. A problem with this approach is that there is not strict workload isolation even if separate sockets are used. Cached effects can be observed, and they will also contend for memory access, so the overcommit guests can negatively impact performance of the dedicated CPU guests even if on separate sockets. So while this would be simpler from an administrative POV, it would not give the same performance guarantees that are important for NFV use cases. It would none the less be possible to enhance the design in the future, so that overcommit & dedicated CPU guests could co-exist on the same host for those use cases where admin simplicity is more important than perfect performance isolation. It is believed that it is better to start off with the simpler to implement design based on host aggregates for the first iteration of this feature.

Data model impact

No impact.

The new data items are stored in the existing flavor extra specs data model and in the host state metadata model.

REST API impact

No impact.

The existing APIs already support arbitrary data in the flavor extra specs.

Security impact

No impact.

Notifications impact

No impact.

The notifications system is not used by this change.

Other end user impact

There are no changes that directly impact the end user, other than the fact that their guest should have more predictable CPU execution latency.

Performance Impact

The scheduler will incur small further overhead if a threads policy is set on the image or flavor. This overhead will be negligible compared to that implied by the enhancements to support NUMA policy and huge pages. It is anticipated that dedicated CPU guests will typically be used in conjunction with huge pages.

Other deployer impact

The cloud administrator will gain the ability to define flavors which offer dedicated CPU resources. The administrator will have to place hosts into groups using aggregates such that the scheduler can separate placement of guests with dedicated vs shared CPUs. Although not required by this design, it is expected that the administrator will commonly use the same host aggregates to group hosts for both CPU pinning and large page usage, since these concepts are complementary and expected to be used together. This will minimise the administrative burden of configuring host aggregates.

Developer impact

It is expected that most hypervisors will have the ability to setup dedicated pCPUs for guests vs shared pCPUs. The flavor parameter is simple enough that any Nova driver would be able to support it.

Implementation

Assignee(s)

Primary assignee:

berrange

Other contributors:

ndipanov

Work Items

  • Enhance libvirt to support setup of strict CPU pinning for guests when the appropriate policy is set in the flavor

  • Enhance the scheduler to take account of threads policy when choosing which host to place the guest on.

Dependencies

Testing

It is unknown at this time if the gate hosts have sufficient pCPUs available to allow this feature to be effectively tested by tempest.

Documentation Impact

The new flavor parameter available to the cloud administrator needs to be documented along with recommendations about effective usage. The docs will also need to mention the compute host deployment pre-requisites such as the need to setup aggregates.

References

Current “big picture” research and design for the topic of CPU and memory resource utilization and placement. vCPU topology is a subset of this work