PCI Passthrough Groups

https://blueprints.launchpad.net/nova/+spec/pci-passthrough-groups

This spec allows operators to create a flavor using a PCI alias to request a group of PCI devices. These groups of PCI devices are tracked as a single indivisible unit within Placement. The default custom resource class used to track these PCI groups is derived from the PCI group type name, and the name of the inventory is derived from the PCI group name. The pci_alias config already supports mapping to a specific placement resource class.

Problem description

Some PCI devices only make sense to be consumed as a group. When you assign the grouped PCI devices to a VM, all of the devices in the group as always consumed together by a single VM. Currently Nova does not understand any grouping other than NUMA affinity.

While there are some cases where a device could be consumed by multiple different groups, that are dynamically picked based on demand, we are ignoring these use cases for now. In particular, we make the simplifying restriction that a tracked PCI device can only be a member of a single group, and when a PCI device is a member of a group, it can only be used as part of that PCI group.

Use Cases

Some GPUs expose both a graphics physical function and an audio function. In order to support passing through both devices, we need to ensure that we pass through a matching pair of devices. This spec would allow a device group to be created such that operators configure the matching pairs of audio and graphics devices, and users can request one (or more) of those pairs via the usual PCI alias.

Note, we are currently excluding the use case of users requesting either the pair of devices or just the graphics device, as that would result in additional complexity that should be considered in a separate follow on specification.

Let us consider the specific case of the Graphcore C200 device, where a set of PCI cards are connected together via IPU-Link: https://docs.graphcore.ai/projects/C600-datasheet/en/latest/product-description.html#ipu-link-cables

Each physical card presents two PCI devices. The card can be used independently of other cards if a matched pair of devices are presented to the VM. PCI groups allows this device to be correctly passed through to VMs by ensuring a matched pair of PCI devices are always assigned to each VM.

In addition, some servers can be statically configured to group either two devices, four devices or eight devices as a single group. These can all be statically configured using PCI group to ensure we always respect the non-PCI connectivity between those PCI devices.

Proposed change

The key parts of this change include:

  • extend [pci]device_spec to model groups of PCI devices

  • devices are linked by both a group type name, and a specific group name

  • the group type name is used to generate a custom resource class, i.e. CUSTOM_PCI_GROUP<group_type_name>. Note this is just the default that changes when you specify a group type name, and it can be overrriden by explicitly specifying a different resource_class tag.

  • Each group is registered in placement, in a similar way to a device. Each group being a separate resource provider with a single inventory item for the associated group type custom resource type, with a name that is generated from the group_name rather than the PCI device address

  • extend [pci]alias simply mapps to the resource class mentioned above, such as CUSTOM_PCI_GROUP_<group_type_name>.

  • PCI tracker will have the group_name and group_type_name added to each device that is being tracked, such that we can look up a group of devices associated with each specific named group tracked in placement.

There will be configuration validation checks:

  • pci groups are only supported when PCI devices are tracked in placement

  • all device groups must have two or more PCI devices

  • each physical PCI device can only be in one group, and must only be tracked in placement once

For example, lets consider the following PCI devices:

  • 4e:00.0 Processing accelerators: Graphcore Ltd Device 0003

  • 4f:00.0 Processing accelerators: Graphcore Ltd Device 0003

  • 89:00.0 Processing accelerators: Graphcore Ltd Device 0003

  • 8a:00.0 Processing accelerators: Graphcore Ltd Device 0003

The two physical cards, spread across two NUMA nodes can be presented in two possible ways: either two groups or a single group, depending on the use cases. For example, two separate devices would be::

[pci]
device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x1"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x1"}
device_spec = {"address": ":4e:00.0", group_name:"graphcore_2", group_type:"c200_x1"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_2", group_type:"c200_x1"}
alias = {"name":"c200_x1", resource_class:"CUSTOM_PCI_GROUP_C200_X1"}

But exposing the two cards, exposed as four PCI devices, as a single unit of 4 PCI devices, would look like this::

[pci]
device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
alias = {"name":"c200_x2", resource_class:"CUSTOM_PCI_GROUP_C200_X2"}

Alternatives

For some simple cases, NUMA affinity can simulate what is required. But currently hardware like Graphcore C200 does not work well with Nova.

Data model impact

PCI tracker needs to be extended to include group_name and group_type for each PCI device.

REST API impact

No impact

Security impact

No impact

Notifications impact

No impact

Other end user impact

No impact

Performance Impact

No impact

Other deployer impact

The device spec configuration gets some extra options to help define groups, and the default resource class changes when you use the new device_group tags, as discussed above.

Developer impact

None

Upgrade impact

Devices that are exposed as a group must be not currently tracked in placement when starting to expose them as a group.

Once new compute nodes will report the new resoruce classes, which should naturally gate the need for older compute nodes to know what to do with the new PCI device configuration.

Implementation

Assignee(s)

Primary assignee:

johngarbutt

Other contributors:

nathanharper

Feature Liaison

Feature liaison:

gibi?

Work Items

  • Update pci device config to support pci groups

  • Update PCI device tracker to know about pci groups

  • Attach groups of devices when device alias requests a resource class that maps to a PCI device group

  • Update placement with the avilable resources from the described pci groups

Dependencies

None

Testing

Add a functional test, similar to vgpu tests.

Documentation Impact

Configuration changes need to be documented correctly.

References

None

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions

Release Name

Description

2024.1 Caracal

Introduced