PCI NUMA Policies¶
https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes
In the Juno release the “I/O based NUMA scheduling” spec was implemented [1]. This modified the scheduling algorithm such that users were only allowed to boot instances with PCI devices if the instance was being scheduled on at least one of the NUMA nodes associated with the PCI devices or if the PCI devices had no information about NUMA nodes and PCI devices affinity. Before this, nova booted instances with PCI devices without checking NUMA affinity. However, such hard-coded behaviour causes problems if not every NUMA node has its own PCI device. In this case nova wouldn’t allow booting an instance on NUMA nodes without PCI devices.
Problem description¶
In its current iteration, nova boots instances with PCI devices on the same NUMA nodes that these PCI devices are associated with. This is good for performances, as it ensures there is limited cross-NUMA node memory traffic. However, if a user has an environment with two NUMA nodes and only one PCI device (for example SR-IOV card associated with first NUMA node) they would be able to boot instance with one NUMA node and SR-IOV ports only on the first NUMA node. In this case, the user cannot use half of the CPUs and RAM because these resources are placed on second NUMA node. The user should be able to boot instances on different NUMA nodes, even if it makes performance worse.
In addition, the current behavior doesn’t always provide the best performance solution because an instance can use a PCI device if there is no information about affinity of NUMA nodes with this PCI device. This can lead to a situation whereby PCI device is not on the NUMA node which the CPU and RAM is on. The scheduling mechanism should be more flexible. The user should be able to choose between maximum performance behavior and maximum chance of successfully launching the instance.
Of course this ability should be configurable and the current scheduling behaviour must remain as the default.
Use Cases¶
As an operator who cares about obtaining maximum performance from my PCI devices, I want to ensure my PCI devices are always NUMA affinitized, even if this results in lower resource usage.
As an operator who cares about maximum usage of resources, I want to ensure that an instance has the best chance of being scheduled successfully, even if this results in slightly lower performance for some instances.
As an operator of a deployment with a mix of NUMA-aware and non-NUMA-aware hosts, I want to ensure my PCI devices are always NUMA affinitized if NUMA information is available. However, I still want to be able to schedule instances of the non-NUMA-aware hosts.
Alternatively, as an operator with an existing deployment using PCI devices, I don’t want nova to pull the rug from under my feet and suddenly refuse to schedule to hosts with no NUMA information when it used to.
Proposed change¶
This spec is needed to decide the affinity of PCI devices used by instances. To
this end, we will add a new key, numa_policy
, to the [pci] alias
JSON
configuration option. This option can have one of three values.
required
This value will mean that nova will boot instances with PCI devices only if at least one of the NUMA nodes is associated with these PCI devices. It means that if NUMA node info for some PCI devices could not be determined, those PCI devices wouldn’t be consumable by the instance. This provides maximum performance.
preferred
This value will mean that nova-scheduler will choose a compute host with minimal consideration for the NUMA affinity of PCI devices. nova-compute will attempt a best effort selection of PCI devices based on NUMA affinity, however, if this is not possible then nova-compute will fall back to scheduling on a NUMA node that is not associated with the PCI device.
Note that even though the
NUMATopologyFilter
will not consider NUMA affinity, the weigher proposed in the Reserve NUMA Nodes with PCI Devices Attached spec [2] can be used to maximize the chance that a chosen host will have NUMA-affinitized PCI devices.
legacy
This is the default value and it describes the current nova behavior. Usually we have information about association of PCI devices with NUMA nodes. However, some PCI devices do not provide such information. The
legacy
value will mean that nova will boot instances with PCI device if either:
The PCI device is associated with at least one NUMA nodes on which the instance will be booted
There is no information about PCI-NUMA affinity available
This is required because the configuration option will apply globally to an instance which may have multiple devices attached, and not all of these devices may have NUMA affinity. An example of such a device is the FPGAs integrated on to the dies of recent Intel Xeon chips, which hook into the QPI bus and therefore have no NUMA affinity [3].
The end result will be an option that looks something like this:
[pci]
alias = '{
"name": "QuickAssist",
"product_id": "0443",
"vendor_id": "8086",
"device_type": "type-PCI",
"numa_policy": "legacy"
}'
Alternatives¶
Change placement behavior to not boot instances which do not need PCI devices on NUMA nodes with PCI devices. This would maximize the possibility that an instance that requires PCI devices could find a suitable host to boot on. However, it would severely limit our flexibility as attempting to boot many instances without PCI devices would result in a large number of unused, PCI device-having hosts. Furthermore, once all non-PCI-having NUMA nodes are saturated, deploys of non-PCI-needing instances would fail.
Change placement behavior to avoid booting instances without PCI devices on NUMA nodes with PCI devices if possible. This is a softer version of the first alternative and has actually been addressed by the ‘reserve-numa-with-pci’ spec [4].
Make the PCI NUMA strictness part of the device request. This level of granularity would likely be sufficient but it does necessitate another lot of flavor extra specs and image metadata options. This isn’t something we want.
Data model impact¶
A new field, numa_policy
, will be added to the InstancePCIRequest
object. As this object is stored as a JSON blob in the database, no DB
migrations are necessary to add the new field to this object.
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None
Performance Impact¶
If the required
policy is selected, the performance of instances with PCI
devices will be more consistent in deployments with non-NUMA aware compute
hosts present. This is because nova would no longer use these hosts. However,
this will also result in a smaller number of hosts available on which to
schedule instances. If all hosts correctly provide NUMA information,
performance will be unchanged.
If the preferred
policy is selected, the performance of instances with PCI
devices may be worse for some instances. This would be because nova can now
schedule an instance on a host with non-NUMA-affinitized PCI devices. However,
this will also result in a larger number of hosts available on which to
schedule instances, maximizing flexibility for operators who don’t require
maximum performance. The PCI weigher proposed in the Reserve NUMA Nodes with
PCI Devices Attached [2] can be used to minimize the risk of performance
impacts.
If the legacy
policy is selected, the existing nova behaviour will be
retained and performance will remain unchanged.
From a scheduling perspective, this may introduce a delay if the required
policy is selected and there are a large number of hosts with PCI devices that
do not report NUMA affinity. On the other hand, using the preferred
policy
will result in improved performance as the ability to schedule is no longer
tied to the availability of a free CPUs on a NUMA node associated with the PCI
device.
Other deployer impact¶
None
Developer impact¶
None
Implementation¶
Assignee(s)¶
- Primary assignee:
Stephen Finucane (stephenfinucane)
- Other contributors:
Sergey Nikitin (snikitin)
Work Items¶
Add new field to the
[pci] alias
optionAdd new field to the
InstancePCIRequest
objectChange the process of NUMA node choosing, considering new policy
Update user docs
Dependencies¶
None
Testing¶
Scenario tests will be added to validate these modifications.
Documentation Impact¶
This feature will not add a new scheduling filter, but it will change the
behaviour of NUMATopologyFilter
. We should add documentation to describe
the new key for the [pci] alias
option.
References¶
History¶
Release Name |
Description |
---|---|
Queens |
Introduced |