Off-path SmartNIC DPU Port Binding with OVN

https://blueprints.launchpad.net/neutron/+spec/off-path-smartnic-dpu-port-binding-with-ovn

Off-path SmartNIC DPUs introduce an architecture change where network agents responsible for NIC switch configuration and representor interface plugging run on a separate SoC with its own CPU, memory and that runs a separate OS kernel. The side-effect of that is that hypervisor hostnames no longer match SmartNIC DPU hostnames which are seen by ovs-vswitchd and OVN [3] agents while the existing port binding code relies on that. The goal of this specification is to introduce changes necessary to extend the existing hardware offload code to cope with the hostname mismatch and related design challenges while reusing the rest of the code. To do that, PCI(e) add-in card tracking is introduced for cards with unique serial numbers so that it can be used to determine the correct hostname of a SmartNIC DPU which is responsible for a particular VF. Additionally, more information is suggested to be passed in the “binding:profile” during a port update to facilitate representor port plugging.

Nova spec: https://review.opendev.org/c/openstack/nova-specs/+/787458

Problem Description

Terminology

  • Data Processing Unit (DPU) - an embedded system that includes a CPU, a NIC and possibly other components on its board which integrates with the main board using some I/O interconnect (e.g. PCIe);

  • Off-path SmartNIC DPU architecture [1] [2] - an architecture where NIC cores are responsible for programming a NIC Switch and are bypassed when rules programmed into the NIC Switch are enough to make a decision on where to deliver packets. Normally, NIC cores only participate in packet forwarding for the “slow path” only and the “fast path” is handled in hardware like an ASIC;

  • On-path SmartNIC DPU architecture [1] [2] - an architecture where NIC cores participate in processing of every packet going through the NIC as a whole. In other words, NIC cores are always on the “fast path” of all packets;

  • NIC Switch (or eSwitch) - a programmable embedded switch present in various types of NICs (SR-IOV-capable NICs, off-path SmartNICs). Typically relies on ASICs for packet processing;

  • switchdev [4] - in-kernel driver model for switch devices which offload the forwarding (data) plane from the kernel.

  • Representor ports [5] - a concept introduced in the switchdev model which models netdevs representing switch ports. This applies to NIC switch ports (which can be physical uplink ports, PFs or VFs);

  • devlink [6] - a kernel API to expose device information and resources not directly related to any device class, such as chip-wide/switch-ASIC-wide configuration;

  • PCI/PCIe Vital Product Data (VPD) - a standard capability exposed by PCI(e) endpoints which, among other information, includes a unique serial number (read-only, persistent, factory-generated) of a card shared by all functions exposed by it. Present in PCI local bus 2.1+ and PCIe 4.0+ specifications.

Detailed overview

The related Nova specification [7] is an integral part of the cross-project effort and its problem description is assumed to be a part of this specification.

There are also relevant changes to OVS [8] and OVN itself [9] [10] that are needed in order to facilitate the implementation of this specification in Neutron. The result of upstream OVN discussions has been that a separate ovn-vif [11] component is going to be introduced and the corresponding change [10] is in the late stages of the review process at the time of writing. Support in other ML2 mechanism drivers is possible, but we would like to target OVN first and leave other ML2 drivers out of scope for Yoga cycle.

In the context of off-path SmartNIC DPUs, the main problems that need to be addressed in this change are:

  • SmartNIC DPU hostname selection such that the correct host is selected for representor port plugging. This corresponds to selecting the right chassis (transport node) in OVN terms;

  • VF representor selection corresponding the VF and a PF it is associated with selected at the hypervisor host side by Nova.

Some practical challenges are as follows:

  • If PCIe is used to access the NIC at a SmartNIC DPU host, PCI addresses seen by the hypervisor host and the SmartNIC DPU host differ as they have their own root complexes and see isolated PCIe topologies (while communicating with the same NIC). Therefore, PCI addresses seen from the SmartNIC DPU host side have no meaning at the hypervisor host side;

  • A NIC Switch is not exposed to PFs at the hypervisor host side. Meanwhile, PF representor numbers are tied to a particular NIC Switch (to a controller number) at the SmartNIC DPU host side. There may be multiple controllers per SmartNIC DPU host in a general case so simply passing PF logical numbers from the host side is not enough to identify a PF representor at the transport node side;

  • VF logical numbering seen via devlink port attributes depends on a device driver implementation at the hypervisor host and does not always match what is seen at the SmartNIC DPU host via the NIC Switch. It is possible to retrieve VF logical numbers tied to PFs at the hypervisor side based on the VF PCI address assignment process governed by the PCI SIG SR-IOV specification;

  • During port binding, the OVN mechanism driver needs to handle the vnic type VNIC_REMOTE_MANAGED which it currently does not.

The following diagram illustrates various components that play a role in the context of this specification:

                       ┌────────────────────────────────────┐
                       │  Hypervisor                        │    LoM Ports
                       │  ┌───────────┐       ┌───────────┐ │   (on-board,
                       │  │ Instance  │       │  Nova     │ ├──┐ optional)
                       │  │ (QEMU)    │       │ Compute   │ │  ├─────────┐
                       │  │           │       │           │ ├──┘         │
                       │  └───────────┘       └───────────┘ │            │
                       │                                    │            │
                       └────────────────┬─┬───────┬─┬──┬────┘            │
                                        │ │       │ │  │                 │
                                        │ │       │ │  │ Control Traffic │
                           Instance VF  │ │       │ │  │ PF associated   │
                                        │ │       │ │  │ with an uplink  │
                                        │ │       │ │  │ port or a VF.   │
                                        │ │       │ │  │ (used to replace│
                                        │ │       │ │  │  LoM)           │
   ┌────────────────────────────────────┼─┼───────┼─┼──┼─┐               │
   │   SmartNIC DPU Board               │ │       │ │  │ │               │
   │   (transport node)                 │ │       │ │  │ │               │
   │  ┌──────────────┐ Control traffic  │ │       │ │  │ │               │
   │  │   App. CPU   │ via PFs or VFs  ┌┴─┴───────┴─┴┐ │ │               │
   │  ├──────────────┤  (DC Fabric)    │             │ │ │               │
   │  │ovn-controller├─────────────────┼─┐           │ │ │               │
   │  ├──────────────┤                 │ │           │ │ │               │
   │  │ovs-vswitchd  │     Port        │ │NIC Switch │ │ │               │
   │  ├──────────────┤   Representors  │ │  ASIC     │ │ │               │
   │  │    br-int    ├─────────────────┤ │           │ │ │               │
   │  │              ├─────────────────┤ │           │ │ │               │
   │  └──────────────┘                 │ │           │ │ │               │
   │                                   │ │           │ │ │               │
   │                                   └─┼───┬─┬─────┘ │ │               │
 ┌─┴──────┐Initial NIC Switch            │   │ │       │ │               │
─┤OOB Port│configuration is done via     │   │ │uplink │ │               │
 └─┬──────┘the OOB port to create        │   │ │       │ │               │
   │       ports for control traffic.    │   │ │       │ │               │
   └─────────────────────────────────────┼───┼─┼───────┼─┘               │
                                         │   │ │       │                 │
                                      ┌──┼───┴─┴───────┼────────┐        │
                                      │  │             │        │        │
                                      │  │   DC Fabric ├────────┼────────┘
                                      │  │             │        │
                                      └──┼─────────────┼────────┘
                                         │             │
                                         │         ┌───┴──────┐
                                         │         │          │
                                     ┌───▼──┐  ┌───▼───┐ ┌────▼────┐
                                     │OVN SB│  │Neutron│ │Placement│
                                     └──────┘  │Server │ │         │
                                               └───────┘ └─────────┘

Proposed Change

Identifying a SmartNIC DPU Host and VF Representor

The related Nova specification [7] introduces add-in-card serial number collection via PCIe VPD and forwarding of that information in port updates to Neutron, therefore, it becomes possible to match that against what is seen at the SmartNIC DPU host side. Those serial numbers are unique, read-only, and assigned at the manufacturing time (per the PCI and PCIe specs). The board serial number can be stored in the OVN SB database as an external-id for a particular OVN chassis:

external_ids:ovn-cms-options='card-serial-number=UNIQUEBOARDSERIAL'

Note that there can be other means of accessing the NIC from the SmartNIC DPU host side (e.g. platform devices or other I/O types) so querying the serial number can be done either by extracting PCIe VPD via sysfs or by using devlink-info API and getting board.serial_number (which does not depend on a particular I/O interconnect type). However, this is a concern for OVN itself since Neutron does not have any agents running at the SmartNIC DPU side - it only needs to look up a chassis by a card-serial-number regardless of how it got collected at the SmartNIC DPU host.

The responsibility of placing it there can be given to:

  • A deployer who will be responsible of configuring the ovn-controller agent to be responsible for a particular transport node;

  • ovn-controller itself based on some default behavior of looking up a card serial number based on the switchdev-capable devices available on the SmartNIC DPU host.

Regardless of the chosen method of populating this value, Neutron will then be able to lookup which OVN chassis should handle representor plugging and flow programming for a particular port update that has a card serial included.

VF logical numbers are relevant in the context of a particular PF. In turn, PF logical numbers are tied to a particular controller. Since a NIC Switch is not exposed to the hypervisor host, it cannot determine the controller logical number as visible by the SmartNIC DPU host and while PF numbers could be inferred indirectly from their PCI address function numbers, this information is not enough to identify a PF representor at the SmartNIC DPU host side. To address that, a PF MAC seen by the hypervisor host can be passed to the SmartNIC DPU host. While port representors have a different MAC address from the port they represent, it is possible to use generic in-kernel API to retrieve a MAC address of a function via its representer (as of kernel 5.9 [12], subject to the switchdev-aware device driver support, for example [13]). While Neutron will not be responsible for doing this lookup (OVN will), it needs to accept and forward this information to OVN.

As a result, the binding:profile attribute for a port updated by Nova is expected to contain the following information:

binding:profile={
    "pci_vendor_info",
    "pci_slot",
    "physical_network",
    "card_serial_number": "UNIQUEBOARDSERIAL",
    "pf_mac_address": "de:ad:be:ef:ca:fe",
    "vf_num": 42
}

With this information both the right OVN chassis hostname and port representor at the SmartNIC DPU host side can be identified.

The OVN mechanism driver also needs to be changed (where relevant) to use the hostname looked up based on a serial number instead of relying on the hypervisor hostname passed in binding_host_id.

After receiving a port update from Nova, Neutron needs to create relevant Logical Switch Ports in the OVN database (the final approach to communicating the necessary information to OVN is to be determined based on the outcome of ovn-vif discussions in upstream OVN):

Logical_Switch_Port
  options:requested-chassis=fqdn-of-a-smartnic-dpu-host
  options:plug-type=representor
  options:plug-mtu-request=1500
  options:plug:representor:pf-mac=de:ad:be:ef:ca:fe
  options:plug:representor:vf-num=0

If vf-num is not specified, then a PF representor should be plugged instead of a VF representor.

Port Binding and Port Capabilities

Currently, to support hardware offload, the bind_port method of the OVN mechanism driver is able to bind ports of type direct if they also have the “switchdev” capability supplied in their binding:profile attribute:

binding:profile='{"capabilities": ["switchdev"]}'

This capability is discovered by Nova from PCI(e) endpoints via devlink - when an NIC Switch is visible via a PF, the “switchdev” capability is added to both PFs and VFs tracked in Nova. SmartNIC DPUs do not expose the NIC Switch to the hypervisor host, therefore, this capability is not discovered. To address that, the related Nova specification [7] relies on a new port type for the purposes of working with SmartNIC DPUs called VNIC_REMOTE_MANAGED. The VNIC_SMARTNIC VNIC type already present in neutron-lib has been considered but it was found later that the fact that scheduling happens after resource request creation makes it to run into a conflict with the Ironic use-case).

The OVN mechanism driver needs to be extended to also handle VNIC_REMOTE_MANAGED ports. It will allow the port binding code to pick the right code-path and trigger the representor port plugging logic in OVN.

Implementation

Primary Assignees

  • Dmitrii Shcherbakov (lp: ~dmitriis, oftc || libera: dmitriis)

  • Frode Nordahl (lp: ~fnordahl, oftc || libera: fnordahl)

Dependencies

  • OVS [8] and OVN [9] [10] [11] changes.

  • A related Nova specification [7] depends on this spec;

References