PCI NUMA Policies

https://blueprints.launchpad.net/nova/+spec/share-pci-between-numa-nodes

In the Juno release the “I/O based NUMA scheduling” spec was implemented [1]. This modified the scheduling algorithm such that users were only allowed to boot instances with PCI devices if the instance was being scheduled on at least one of the NUMA nodes associated with the PCI devices or if the PCI devices had no information about NUMA nodes and PCI devices affinity. Before this, nova booted instances with PCI devices without checking NUMA affinity. However, such hard-coded behaviour causes problems if not every NUMA node has its own PCI device. In this case nova wouldn’t allow booting an instance on NUMA nodes without PCI devices.

Problem description

In its current iteration, nova boots instances with PCI devices on the same NUMA nodes that these PCI devices are associated with. This is good for performances, as it ensures there is limited cross-NUMA node memory traffic. However, if a user has an environment with two NUMA nodes and only one PCI device (for example SR-IOV card associated with first NUMA node) they would be able to boot instance with one NUMA node and SR-IOV ports only on the first NUMA node. In this case, the user cannot use half of the CPUs and RAM because these resources are placed on second NUMA node. The user should be able to boot instances on different NUMA nodes, even if it makes performance worse.

In addition, the current behavior doesn’t always provide the best performance solution because an instance can use a PCI device if there is no information about affinity of NUMA nodes with this PCI device. This can lead to a situation whereby PCI device is not on the NUMA node which the CPU and RAM is on. The scheduling mechanism should be more flexible. The user should be able to choose between maximum performance behavior and maximum chance of successfully launching the instance.

Of course this ability should be configurable and the current scheduling behaviour must remain as the default.

Use Cases

  • As an operator who cares about obtaining maximum performance from my PCI devices, I want to ensure my PCI devices are always NUMA affinitized, even if this results in lower resource usage.

  • As an operator who cares about maximum usage of resources, I want to ensure that an instance has the best chance of being scheduled successfully, even if this results in slightly lower performance for some instances.

  • As an operator of a deployment with a mix of NUMA-aware and non-NUMA-aware hosts, I want to ensure my PCI devices are always NUMA affinitized if NUMA information is available. However, I still want to be able to schedule instances of the non-NUMA-aware hosts.

    Alternatively, as an operator with an existing deployment using PCI devices, I don’t want nova to pull the rug from under my feet and suddenly refuse to schedule to hosts with no NUMA information when it used to.

Proposed change

This spec is needed to decide the affinity of PCI devices used by instances. To this end, we will add a new key, numa_policy, to the [pci] alias JSON configuration option. This option can have one of three values.

required

This value will mean that nova will boot instances with PCI devices only if at least one of the NUMA nodes is associated with these PCI devices. It means that if NUMA node info for some PCI devices could not be determined, those PCI devices wouldn’t be consumable by the instance. This provides maximum performance.

preferred

This value will mean that nova-scheduler will choose a compute host with minimal consideration for the NUMA affinity of PCI devices. nova-compute will attempt a best effort selection of PCI devices based on NUMA affinity, however, if this is not possible then nova-compute will fall back to scheduling on a NUMA node that is not associated with the PCI device.

Note that even though the NUMATopologyFilter will not consider NUMA affinity, the weigher proposed in the Reserve NUMA Nodes with PCI Devices Attached spec [2] can be used to maximize the chance that a chosen host will have NUMA-affinitized PCI devices.

legacy

This is the default value and it describes the current nova behavior. Usually we have information about association of PCI devices with NUMA nodes. However, some PCI devices do not provide such information. The legacy value will mean that nova will boot instances with PCI device if either:

  • The PCI device is associated with at least one NUMA nodes on which the instance will be booted

  • There is no information about PCI-NUMA affinity available

This is required because the configuration option will apply globally to an instance which may have multiple devices attached, and not all of these devices may have NUMA affinity. An example of such a device is the FPGAs integrated on to the dies of recent Intel Xeon chips, which hook into the QPI bus and therefore have no NUMA affinity [3].

The end result will be an option that looks something like this:

[pci]
alias = '{
  "name": "QuickAssist",
  "product_id": "0443",
  "vendor_id": "8086",
  "device_type": "type-PCI",
  "numa_policy": "legacy"
}'

Alternatives

  • Change placement behavior to not boot instances which do not need PCI devices on NUMA nodes with PCI devices. This would maximize the possibility that an instance that requires PCI devices could find a suitable host to boot on. However, it would severely limit our flexibility as attempting to boot many instances without PCI devices would result in a large number of unused, PCI device-having hosts. Furthermore, once all non-PCI-having NUMA nodes are saturated, deploys of non-PCI-needing instances would fail.

  • Change placement behavior to avoid booting instances without PCI devices on NUMA nodes with PCI devices if possible. This is a softer version of the first alternative and has actually been addressed by the ‘reserve-numa-with-pci’ spec [4].

  • Make the PCI NUMA strictness part of the device request. This level of granularity would likely be sufficient but it does necessitate another lot of flavor extra specs and image metadata options. This isn’t something we want.

Data model impact

A new field, numa_policy, will be added to the InstancePCIRequest object. As this object is stored as a JSON blob in the database, no DB migrations are necessary to add the new field to this object.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

If the required policy is selected, the performance of instances with PCI devices will be more consistent in deployments with non-NUMA aware compute hosts present. This is because nova would no longer use these hosts. However, this will also result in a smaller number of hosts available on which to schedule instances. If all hosts correctly provide NUMA information, performance will be unchanged.

If the preferred policy is selected, the performance of instances with PCI devices may be worse for some instances. This would be because nova can now schedule an instance on a host with non-NUMA-affinitized PCI devices. However, this will also result in a larger number of hosts available on which to schedule instances, maximizing flexibility for operators who don’t require maximum performance. The PCI weigher proposed in the Reserve NUMA Nodes with PCI Devices Attached [2] can be used to minimize the risk of performance impacts.

If the legacy policy is selected, the existing nova behaviour will be retained and performance will remain unchanged.

From a scheduling perspective, this may introduce a delay if the required policy is selected and there are a large number of hosts with PCI devices that do not report NUMA affinity. On the other hand, using the preferred policy will result in improved performance as the ability to schedule is no longer tied to the availability of a free CPUs on a NUMA node associated with the PCI device.

Other deployer impact

None

Developer impact

None

Implementation

Assignee(s)

Primary assignee:

Stephen Finucane (stephenfinucane)

Other contributors:

Sergey Nikitin (snikitin)

Work Items

  • Add new field to the [pci] alias option

  • Add new field to the InstancePCIRequest object

  • Change the process of NUMA node choosing, considering new policy

  • Update user docs

Dependencies

None

Testing

Scenario tests will be added to validate these modifications.

Documentation Impact

This feature will not add a new scheduling filter, but it will change the behaviour of NUMATopologyFilter. We should add documentation to describe the new key for the [pci] alias option.

References

History

Revisions

Release Name

Description

Queens

Introduced