Reserve NUMA nodes with PCI devices attached¶
Since Juno, instances bound with PCI devices must be scheduled to at least one NUMA node associated with the PCI device 1. Unfortunately, the scheduler was not enhanced to ensure instances without a PCI device would not occupy NUMA nodes unnecessarily. This spec proposes to optimize the scheduler to ensure these NUMA nodes are reserved, thus increasing the number of PCI-attached instances deployers can boot in conjunction with non-PCI instances.
The NUMA locality of I/O devices is an important characteristic to consider when configuring a high performance, low latency system for NFV workloads. The ‘I/O (PCIe) Based NUMA Scheduling’ blueprint optimized instance placement by ensuring that scheduling of instances bound to a PCI device, via PCI passthrough requests, is optimized to ensure NUMA node co-location for PCI devices and CPUs. However, the scheduler uses nodes linearly, even when there are only a select few of these many nodes associated with special resources like PCI devices. As a result, instances without any PCI requirements can fill host NUMA nodes with PCI devices attached, which results in scheduling failures for PCI-bound instances.
As an operator, I want to reserve nodes with PCI devices, which are typically expensive and very limited resources, for guests that actually require them.
As a user launching instances that require PCI devices, I want the cloud to ensure that they are available.
Enhance both the filter scheduler and resource tracker to prefer non-PCI NUMA nodes for non-PCI instances.
If an instance is bound to a PCI device, then existing behavior dictates that the NUMA node associated with the PCI device will be used at a minimum.
If an instance is not bound to a PCI device, then hosts without PCI devices will be preferred. If no host matching this and other requirements exists, then hosts with PCI devices will be used but NUMA nodes without associated PCI devices will be preferred.
Instances with PCI devices must still be scheduled on nodes with a PCI device attached. Enabling some sort of “soft affinity” where this is no longer a requirement is outside of the scope of this blueprint.
Add a configuration option that allows instances to schedule to nodes other than those associated with the PCI device(s). This will ensure instances can fully utilize resources, but will not solve the problem of non-PCI instances occupying preferred NUMA nodes. This should be seen as a complement, rather than an alternative.
Ensure PCI devices are placed in PCI slots associated with the highest-numbered NUMA node. PCI-based instances will always use these, while non-PCI instances are assigned to node linearly (and therefore, lowest first). However, this would mean moving tens or even thousands of PCI devices and would require a spreading, rather than packing, based approach to host scheduling.
Use host aggregates instead. This doesn’t require any new functionality but it will fail in the scenario where a host does not have uniform PCI availability across all nodes or where instances consume all PCI devices on a host but not all CPUs. In both cases, a given amount of resources on said hosts will go to waste.
Use host aggregates instead. This doesn’t require any new functionality but it would necessitate restricting the capacity of a deployment in a very static fashion for the sake of maximizing the chance that PCI instances will schedule successfully.
Host aggregates make sense for something like separating pinned instances from unpinned, because scheduling a non-pinned instance would effectively defeat the whole purpose of using pinning in the first place (the unpinned instance would float across all available host cores, including pinned cores, negating the performance improvements that pinning provides). This is a strict requirement. For the PCI case, on the other hand, nothing bad will happen if we schedule a non-PCI instance on a PCI-capable host: we’ll just have less capacity on PCI hosts for instances that need them. That doesn’t mean trying to restrict non-PCI devices from using PCI-capable hosts is a bad thing to do: making the scheduler “smarter” and maximizing the chance that an instance will be scheduled successfully is always going to be a win. However, artificially limiting the amount of resources available to you _is_ a very bad thing. Regardless of whether you have uniform hardware or not, it is unlikely that you will uniform workloads, and it is very likely that the amount of PCI vs. non-PCI workloads you have will vary with time. This makes host aggregates a poor solution to this problem.
Data model impact¶
REST API impact¶
Other end user impact¶
An additional weigher will be added, which will assess the number of PCI devices on each node of a host. As all weighers are enabled by default, this will result in an slight increase in latency during filtering. However, this impact will be negligible compared to the performance enhancement that of using correctly-affinitized PCI devices brings, nor the cost saving incurred from fully utilizing all available hardware.
Other deployer impact¶
The PCI weigher will be added to
However, deployers may wish to enable this manually using the
filter_scheduler.weight_classes configuration option.
- Primary assignee:
Add a new
PCIWeigherweigher class to prefer hosts without PCI devices when there are no PCI devices attached to the instance and vice versa
Modify scheduling code to prefer cores on NUMA nodes without attached PCI devices when there are no PCI devices attached to the instance
Functional test which fake out libvirt resource reporting but will actually test the scheduler
A new weigher will be added. This should be documented.