socket PCI NUMA affinity Policy¶
Nova’s current support for NUMA affinity for PCI devices is limited in the kinds of affinity that it can express. Either a PCI device has affinity for a NUMA node, or no affinity at all. This makes one of two assumptions about the underlying host NUMA topology. Either there is only a single NUMA node per socket, or - for cluster on die topologies with multiple nodes per socket - there are enough CPUs in each NUMA node to fit reasonably large VMs that require strict PCI NUMA affinity. The latter assumption is no longer true, and Nova needs a more nuanced way to express PCI NUMA affinity.
Consider a guest with 16 CPUs and a PCI device, and a require PCI NUMA affinity policy. Such a policy requires the guest to “fit” entirely into the host NUMA node to which the PCI device is affined. Until recently, this was a reasonnable expectation: more than 16 CPUs per NUMA node was the norm, even in hosts with multiple NUMA nodes per socket.
With more recent hardware like AMD’s Zen2 architecture, this is no longer the case. Depending on the BIOS configuration, there could be as little as 8 CPUs per NUMA node. This effectively makes a 16-CPU guest with a require PCI device un-schedulable, as no host NUMA node can fit the entire guest.
Zen2 BIOSes have a L3AsNUMA configuration option, which creates a NUMA node for every level 3 cache. Up to 4 cores can share an L3 cache, with 2 SMT threads per core. This is how the number 8 was arrived at. See the AMD Developer Documentation 1 for more details.
As an NFV cloud operator, I want to make full use of my hardware (AMD Zen2, or Intel with cluster on die enabled) with minimal performance penalties.
This spec proposes a new value for the
hw:pci_numa_affinity_policy (and the
hw_pci_numa_affinity_policy image property). The value is
it indicates that the instance’s PCI device has to be affined to the same
socket as the host CPUs that it is pinned to. If no such devices are available
on any compute hosts, the instance fails to schedule. In that sense,
is the same as
require, except the PCI device must belong to the same
socket, rather than the same host NUMA node. In the case of multiple NUMA
nodes, the PCI device must belong to the same socket as one of the NUMA
To better understand the new policy, consider some examples.
In the following oversimplified diagram, an instance with
hw_pci_numa_affinity_policy=socket can be pinned to NUMA node N0 or N1,
but not N2 or N3
+----------+ +----------+ | N0 N1 | | N2 N3 | | +---PCI | | | Socket 0 | | Socket 1 | +----------+ +----------+
Remaining with the same diagram, if the instance has
instead, it can be pinned to the following, as they all have at least one guest
NUMA node pinned to the PCI device’s socket.
N0 and N1
N0 and N2
N0 and N3
N1 and N2
N1 and N3
The instance cannot be pinned to N2 and N3, as they’re both on a different socket from the PCI device.
The implementation requires knowing the socket affinity of host CPUs and PCI
devices. For CPUs, the libvirt driver obtains that information from libvirt’s
host capabilities and saves it in a new field in the
NUMACell object. For
PCI devices, the existing
PCIDevice.numa_node field can be used to look up
NUMACell object and obtain its socket affinity.
The socket affinity information is then used in
numa_fit_instance_to_host(), specifically when it calls down
to the PCI manager’s
There are no alternatives with a similar level of simplicity. A more complex model could include numeric NUMA distances and/or PCI root complex electrical connection vs memory mapping affinity.
At the implementation level, an alternative to looking up the PCI device socket
affinity every time is to save it in a new field in the
This is ruled out because it adds a database migration, and is less flexible
Another alternative for the same purpose is to use the
extra_info field in
PCIDevice. It is a JSON blob that can accept arbitrary new entries. One of
the original purposes of Nova objects was to avoid unversioned dicts flying
over the wire. Relying on JSON blobs inside objects goes against this. In
addition, socket affinity is applicable to all PCI devices, and so does not
belong in a device-specific
Data model impact¶
socketinteger field is added to the
NUMACellobject. No database changes are necessary here, as the object is stored as a JSON blob. The field is populated at runtime by the libvirt driver.
REST API impact¶
No API changes per se, and definitely no new microversion. A new
value is added to the list of possible values for the
hw:pci_numa_affinity_policy flavor extra spec and the
hw_pci_numa_affinity_policy image property. The flavor extra spec
validation logic is extended to support the new value.
Other end user impact¶
There is minimal impact on Nova performance. Documentation on the performance
impact of using the new
socket NUMA affinity policy on various
architectures may be necessary.
Other deployer impact¶
Only the libvirt driver supports PCI NUMA affinity policies. This spec builds on that support.
The current (pre-Wallaby) implementation of
legacy as values for
hw_pci_numa_affinity_policy, with the latter being the catch-all default.
Therefore, instances with
hw_pci_numa_affinity_policy=socket cannot be
permitted to land on pre-Wallaby compute hosts: the
socket value would not
be recognized, and they would be incorrectly treated as having the
To ensure that only Wallaby compute hosts receive instances with
hw_pci_numa_affinity_policy=socket, a new trait is reported by the Wallaby
libvirt driver to indicate that it supports the new policy. A corresponding
request pre-filter is added.
- Primary assignee:
- Feature liaison:
socketinteger field to the
Libvirt driver starts populating the new
PciDeviceStats._filter_pools(), as called by
PciDeviceStats.support_requests(), to support the new
socketNUMA affinity policy.
Add COMPUTE_SOCKET_NUMA_AFFINITY trait (name can be adjusted during implementation) and corresponding pre-filter.
Extend the flavor extra spec validation to allow the new
While there are aspirations for AMD Zen2 hardware in a third party CI, that is too far in the future to have any impact on this spec. Functional tests will have to do.
The behavior of the new
socket NUMA affinity policy will be documented.
Documentation on the performance impact of using the new
affinity policy on various architectures may be necessary.