NUMA-aware vSwitches

https://blueprints.launchpad.net/nova/+spec/numa-aware-vswitches

vSwitches such as Open vSwitch (with kernel or DPDK datapath), Lagopus, or Contrail DPDK vRouter all have some level of NUMA affinity. This can occur because they use one or more physical network interfaces, usually connected via PCIe, or they use userspace processes that are affined to a given NUMA node. This NUMA affinity is not currently taken into account when creating an instance. This can result in up to a 50% performance drop 1.

Problem description

In I/O (PCIe) based NUMA scheduling, nova tackled the problem of NUMA affinity for PCIe devices when using the libvirt driver. This was done by utilizing the NUMA information for these PCIe devices that was provided by libvirt. However, the use of software switching solutions complicates matters compared to switching that is done in hardware. In these cases, we do not pass through entire PCI devices or virtual functions but rather a VIF object. Nonetheless, the vSwitch will utilize physical hardware to access physical networks and will have an affinity to specific NUMA nodes based on this. This NUMA affinity is not currently accounted for, which can result in cross-NUMA node traffic and significant packet processing performance penalties.

Important

This spec focuses solely on traffic between instances and physical NICs or “physical to virtual (PV)” traffic. Traffic between instances, or “virtual to virtual (VV)” traffic, is not within the scope of this spec.

Worked Example: OVS-DPDK

For those interested in the nitty-gritty of how this works, let’s use an example using one possible vSwitch solution, OVS-DPDK. Consider the following network topology:

../../../_images/numa-aware-vswitches-1.png

In the above, we have two guests, each with a number of interfaces, along with a number of host interfaces. When using OVS-DPDK, the guest interfaces would typically be dpdkvhostuser or dpdkvhostuserclient ports 2, while the physical network interfaces in the host would be dpdk ports 3. Each of these ports has a number of transmit (Tx) and receive (Rx) queues to actually move packets to and from each interface.

OVS-DPDK uses a number of Poll Mode Driver processes (PMDs) 4 to do the actual packet processing and each queue is assigned to a PMD. If DPDK was compiled with libnumactl and CONFIG_RTE_LIBRTE_VHOST_NUMA=y 5, this assignment is NUMA aware and a queue will be assigned to the same NUMA node as either:

  • The physical network interface the queue is from, for a dpdk interface

  • One of the NUMA nodes associated with the guest, for a dpdkvhostuser or dpdkvhostuserclient interface

If multiple queues exist (i.e. multi-queue), assigning is done per queue, ensuring that an individual PMD does not become a bottleneck. Where multiple PMDs exist for a given NUMA node, as is often the case, assigning is done in a round-robin fashion. If there is no local (in the sense of NUMA locality) PMD to service the queue, the queue will be assigned to a remote PMD.

So, let’s further build upon the above example and add this NUMA affinity into the mix. We will say there are two NUMA nodes and both the guests and the NICs are split between these nodes. This gives us something like so:

../../../_images/numa-aware-vswitches-2.png

What we want to avoid is cross-NUMA node traffic. We’re not particularly concerned with VM <-> VM traffic (more on this in a bit), so such traffic would probably look something like this:

../../../_images/numa-aware-vswitches-3.png

Note

In OVS-DPDK, egress traffic (VM -> Phy) is not particularly costly as the PMD dequeuing packets from the guest queue will be located on the same NUMA node as the guest queue itself. Meanwhile, the enqueue onto the physical network interface’s queue is efficient thanks to DMA. However, the return path is much more costly as the enqueue to the guests queue cannot benefit from DMA.

This is the root cause of the issue: clearly the guest should have been put on the same NUMA node as the NIC it’s using, as seen below. This is what we wish to resolve with this spec.

../../../_images/numa-aware-vswitches-4.png

Use Cases

  • As a user, I want to ensure I’m getting the maximum benefit from my highly tuned, high-performance vSwitch solution

  • As an operator, I don’t want to revert to custom hacks like dynamically adjusting vcpu_pin_set just so I can deliver the performance my customers demand.

Proposed change

As we want to provide a generic solution that works for multiple network backends, the solution should be built upon neutron and its concepts. neutron does not expose things like Open vSwitch’s bridges, but rather provides network objects like networks, subnets, and ports. Ports aren’t particularly helpful to us as we’re not attaching a given port directly to the guest. Instead, we will utilize networks or, more specifically, networks that utilize physical NICs. There are two types of network technology in play: flat and VLAN, a.k.a. L2 physical or non-tunneled networks, and GRE or VXLAN, a.k.a. L3 tunneled networks.

Note

Others mechanisms, such as OPLEX (Cisco proprietary), are supported in addition to VLAN, VXLAN et al. through the use of non-default Neutron backends. These are considered out-of-scope for this spec. We also consider local, which does not provide external network connectivity, out-of-scope.

Note

The network technologies discussed here are distinct from the network types neutron offers, such as provider networks 6 7 and tenant networks 8 9, and are different again from the architectures available, such as pre-created networking or self-serviced networking. The network types and architectures focus more on who can configure a given network (users for tenant networks, admin for both) and where that configuration is found (for provider networks, generally configuration files, but an admin can explicitly override this). The table below illustrates this.

Common name

User-configurable?

L2 (non-tunneled)

L3 (tunneled)

Provider network

No

Yes

No

Tenant network

Yes

Yes (not common)

Yes (common)

This entire spec focuses on the technologies as these are what determine how traffic will egress from a compute node, which is the primary concern here.

Physical networks

Physical networks, or physnets, are identified through the use of an arbitrary label. In the case of the Open vSwitch (OVS) ML2 driver, these labels are mapped to a given OVS bridge containing the physical interfaces using the [ovs] bridge_mappings neutron configuration option. For example:

openvswitch_agent.ini
[ovs]
bridge_mappings = provider:br-provider

This will map the physnet provider to an OVS bridge br-provider. It is expected that this bridge will contain a logical interface (you can use bonded NICs to provide failover). A similar configuration option exists for the Linux Bridge ML2 driver: [linux_bridge] physical_interface_mappings.

Tunneled networks

Networks with a tunnel overlay, or tunneled networks, may also provide external network connectivity. There can be many tunneled networks but only one logical interface (you can use bonded NICs to provide failover) on a host should be handling traffic for these networks. This interface is configured using the [ovs] local_ip neutron configuration option. For example:

openvswitch_agent.ini
[ovs]
local_ip = OVERLAY_INTERFACE_IP_ADDRESS

This will result in all VXLAN or GRE traffic using the interface whose IP corresponds to OVERLAY_INTERFACE_IP_ADDRESS. A similar configuration option exists for the Linux Bridge ML2 driver for VXLAN traffic: [vxlan] local_ip. This driver does not support GRE traffic.

It is possible for both physical and tunneled networks to be used on the same host. Given all of this, we propose the following changes.

Changes

Configuration options

We propose adding a new configuration option and multiple dynamically-generated configuration groups.

  • The [neutron] physnets configuration option will list all physnets for which you wish to provide NUMA affinity.

  • The [neutron_tunnel] configuration group will allow configuration of the tunneled networks. Only one configuration group is required for these since all tunneled networks must share a logical interface. This group will contain a single configuration option, numa_nodes, which lists the host NUMA node(s) to which tunneled networks are affined.

  • Multiple [neutron_physnet_$PHYSNET] configuration groups, one per each $PHYSNET in [neutron] physnets, will allow configuration of these physnets. Each of these configuration groups will contain a single configuration option each, numa_nodes, which lists the host NUMA node(s) to which networks using this physnet are affined.

The groups will all be generated dynamically, which is required as the values of $PHYSNET in [neutron_physnet_$PHYSNET] are arbitrary and can only be identified from the corresponding values in [neutron] physnets.

This will result in configuration files like the below:

nova.conf
 [neutron]
 physnets = physnet0,physnet1

 [neutron_physnet_physnet0]
 numa_nodes = 0

 [neutron_physnet_physnet1]
 numa_nodes = 0,1

 [neutron_tunnel]
 numa_nodes = 1

where:

[neutron] physnets

A list of strings corresponding to the names of all neutron provider network physnets configured on this host.

Note

Using a combination of neutron’s hierarchical port binding and multiprovider network features, it is possible for neutron to dynamically generate a network segments with a given physnet. This is considered out of scope for this feature.

[neutron_physnet_$PHYSNET]

A section of opts corresponding to one of the physnets defined in [neutron] physnets. This in turn has the following keys:

numa_nodes

A list of integers of NUMA nodes associated with this backend. It is defined as a list to cater for cross NUMA bonds and multipath routing. If this is empty, the physnet has no NUMA affinity assigned to it.

Note

A smart enough vSwitch with a active-active cross-NUMA node bond could use the NUMA affinity of the VM interface as an input to the hash for selecting the bond peer has to ensure no cross-NUMA traffic for inter-host traffic. Alternatively, a dumb one could have os-vif hardcode it using a OpenFlow multipath action and some other Open vSwitch fanciness to set the MAC affinity to NUMA local bond peer. This is outside of the scope of this spec.

[neutron_tunnel]

The same as [neutron_physnet_$PHYSNET] but for the interface used by tunnel networks.

As noted previously, a host may use physnets networks, tunneled networks or a combination of both. As a result, not all configuration values may be specified on a given host.

This configuration will be generated by an orchestration tool at deployment time in the same way that general host network configuration is generated. This is because identifying the NUMA affinity for an arbitrary network is a difficult problem that grows in complexity with the design of the networks and VNFs. The orchestration tool is responsible for configuring how physical NICs are used by the various neutron networks and with only a little extra work can extend this to include NUMA affinity (for example, by combining information from tools like ethtool with information from sysfs).

NUMA affinity will be provided for all networks with a physnet and a defined [neutron_physnet_$PHYSNET] group and for all tunneled networks, assuming [neutron_tunnel] is defined. This will be stored as part of a host’s NUMACell object. If a given network is not defined in any such option, no NUMA affinity will be provided. As with other devices that provide NUMA affinity, namely PCI or SR-IOV 10 devices, attempting to boot an instance may fail in a number of cases. These are described later.

Network API Client

The network API client in nova currently provides a method, create_pci_requests_for_sriov_ports 11, which retrieves information about physnets and VNIC types of each port requested when booting an instance for the purpose of creating PCI requests for SRIOV ports. This will be made more generic and extended to get information about the tunneled status of any requested networks. This information will be stored (but not persisted) in a new object, InstanceNUMANetworks, for later consumption.

Note

As with the existing SR-IOV feature, this will only handle the case where an entire network, as opposed to a segment of the network, is associated with the physnet.

Scheduling

In addition to the configuration change, we will need to start including the list of requested phynets and tunneled networks, stored in the new InstanceNUMANetworks object, as a field of the RequestSpec object. This field will not be persisted in the database. We require this information so we can use it to build the NUMA topology of the instance on the host as part of scheduling but it does not need to be stored after this. We also need to extend the limits dictionary returned by the NUMATopologyFilter to include information about the networks. This is necessary so we have something to reference during the claim stage on the compute. In both cases, the cached information will need to be updated in the event of a move operation.

Virtualization driver

Only the libvirt driver currently supports the full breadth of NUMA affinity features required for this feature. While other drivers, notably Hyper-V, do support some NUMA features, these are related to guest NUMA topologies and not placement of vCPU processes across host NUMA nodes. The NUMA fitting code, found in numa_fit_instance_to_host, will be updated to consume a new InstanceNUMANetworks object and use this to determine which host NUMA nodes to use. The InstanceNUMANetworks object will be built from information in the RequestSpec objects or scheduler limits and passed in by the caller of this function.

Potential issues

As noted above, this is a rather difficult issue to solve sanely and, as a result, there are a number of situations that need special workarounds. The first arises where physical NICs from different NUMA nodes are bonded. This will not break the world but is not intended to have anything but poor performance. This exact situation can arise in SR-IOV too and generally occurs because:

  • The operator has misconfigured something

  • The operator does not care about optimal performance

  • The operator wants to provide resiliency by spreading work across NICs and cannot or will not use two NICs from the same NUMA node

Given that we can’t determine if this was intentional or not, we have defined numa_nodes as a list rather than a single integer. This will allow us to capture the combined NUMA affinity of the bonded NICs.

The second situation is guests with complex, multi-NIC configurations. Most VNFs utilize multiple interfaces, some of which will be used for VM management interfaces (SSH, logging, SNMP, …) while the remainder will be used for actual data plane interfaces. NUMA affinity doesn’t matter for the former so we can ignore these, however, if the latter ports have different affinities, it can be impossible to determine a natural NUMA affinity. This can be seen below:

../../../_images/numa-aware-vswitches-5.png

This exact situation occurs with PCI and SR-IOV devices and here, as with those devices, we expect users to define a multi-node guest topology or to avoid specifying NUMA affinity for the non-data plane interfaces.

The third situation is VM to VM traffic. Current NFV deployments do not generally care about VM to VM traffic as it’s highly unlikely that two related VNFs would ever be intentionally collocated on a given host. As such, VM to VM traffic is considered out of scope for this spec.

The fourth and final situation is interfaces that are attached after creating the instance. Interfaces for which NUMA affinity is required must be requested when creating instances and any interfaces attached after this will not have NUMA affinity applied to them. This is required to limit the scope of this feature and will be resolved in a future spec, however, a resize operation or similar can be used after attaching new interfaces to gain NUMA affinity for these interfaces.

Alternatives

There are multiple possible solutions to this issue of various levels of complexity.

We could store the static backend-NUMA mapping in neutron instead and pass this over the wire as part of the VIF request. As neutron is responsible for all things network’y, storing this information in nova might appear to break the separation of concerns between the two services. However, this is only partially true. While neutron is responsible for creating network elements such as ports and bridges, it is nova that is responsible for wiring them up to instances. This does not get nova into the business of actually creating the underlying network elements, but rather it will now store additional metadata about said elements. Moving this configuration into neutron would complicate matters for little to no gain.

An even more elaborate alternative would be dynamically generate this information in neutron, rather than hardcoding it in either nova or neutron. This information could be exposed via a new extension or similar. This would appear to incur the least amount of effort on a deployer’s side and would seem like the most intelligent thing to do. However, this is not perfect either. Implementing this functionality would require changes to nearly every backend in neutron and may not even be possible for some backends and some particularly complex configurations. As noted previously, we have access to information on the overall network design at deployment time and discarding that in favour of dynamic generation introduces unnecessary magic to the formula.

We could go all in on placement from a neutron side and start storing information about the various backends from neutron. Nova would then be able to use this information in order to make its scheduling decisions. This is technically the best solution as it solves both this issue and a number of other issues, such as bandwidth-aware scheduling and discovery of the types of networks available on a compute node _before_ landing on that node. However, we don’t currently have a clear story for modelling NUMA resources in placement 12. Given the tenant dataplane performance issues being faced right now, this seems untenable. However, that’s not to say we can’t migrate some of this information to placement in the future. This was discussed during the Rocky PTG in Dublin 13.

OpenDaylight (ODL) uses per host configuration stored in OVS (like bridge mappings), which are then read by ODL and populated into Neutron DB as pseudo agents. We could store the NUMA:physnet affinity per network on a per-host basis in OVS, eventually propagating this info to the VIF request for nova. However, this isn’t feasible for tenant networks as the NUMA topology of the host is not visible to a tenant. In addition, this relies on specific features of the network backend which might not be available in other backends.

Finally, and this is the nuclear option, we could simply embrace the fact that scheduling decisions in the NFV space are tough to do automatically and it might be better to delegate this to another system. To this end, we could simply expose a “deploy on this host NUMA node” knob and let users at it. This would simplify our lives and satisfy the requests of these users, but it is distinctly uncloudy and unlikely to gain any traction.

Data model impact

A new object, NUMANetworkInfo, will be added to store information about networks associated with a host NUMA node. This will be mapped to a parent NUMACell object using a network_info field and will be used to compare against the requested networks for one or more instances during scheduling and claiming.

A new object, InstanceNUMANetworks, will be added to store information about whether the networks requested for or attached to an instance were physnets or tunneled networks. They will be populated by a more generic version of the create_pci_requests_for_sriov_ports function. They will not be persisted and is merely stored as a way to get this information from the API to the scheduler for use in filtering.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

There will be a negligible increase in the time taken to schedule an instance on account of the additional NUMA affinity checks. However, network performance of some vSwitch solutions (OVS-DPDK, for example) will increase by up to 100%.

Other deployer impact

Deployment tooling will need to be enhanced to configure the new configuration option at deployment time.

Developer impact

This change will require changes to the scheduler but only the libvirt driver will provide the information necessary to properly schedule these requests. Additional drivers do not currently provide sufficient awareness of host NUMA information for this change to be incorporated in a meaningful manner. If this changes in the future, the drivers will need to be extended to consume the configuration option and request spec attributes.

Upgrade impact

The new configuration options will need to be created for each nova-compute services nova.conf file. In addition, any database migrations will need to be run.

Implementation

Assignee(s)

Primary assignee:

stephenfinucane

Other contributors:

sean-k-mooney

Work Items

  1. Add support for the new configuration option and dynamic configuration groups.

  2. Create a NUMANetworkInfo object. This will be used to store the NUMA-network configuration provided in nova.conf, namely, if any physnets or tunneled networks are affinitized to this cell.

  3. Add a network_info field to the NUMACell object, which will include network information about the host node in the form of a NUMANetworkInfo object. This is required so nova-scheduler and nova-conductor can access this information during scheduling and claiming, respectively.

  4. Create a InstanceNUMANetworks object. This will be used to store the combined physnets or tunneled attributes of the networks requested for the instance.

  5. Make create_pci_requests_for_sriov_ports more generic. This can be used to fetch most of the information we need to populate the new InstanceNUMANetworks object.

  6. Add a numa_networks field to the RequestSpec object, which will be the InstanceNUMANetworks object generated from the call to create_pci_requests_for_sriov_ports by the API service. This information will be used by the NUMATopologyFilter to filter out hosts that could not satisfy the request. This field will not be persisted as it is only needed by the scheduler when scheduling new instances.

  7. Modify the numa_fit_instance_to_host function to accept a new numa_networks argument, which will be an instance of InstanceNUMANetworks, and consider these when building NUMA topologies.

  8. Modify the NUMATopologyFilter to populate a numa_networks field for the limits dictionary. This will be populated with an instance of InstanceNUMANetworks built using the value of RequestSpec.numa_networks. Pass the InstanceNUMANetworks object to numa_fit_instance_to_host.

  9. Modify the serialized models stored in Instance.info_cache.network_info to include something akin to the physnet and tunneled fields for each network. This is required for claims during any move operation.

  10. Modify nova-conductor to build a InstanceNUMANetworks object from the Instance.info_cache.network_info field, which can be used to populate the RequestSpec object for any move operation.

  11. Modify how claims are done on the compute to account for network info from the limits (passed down from the scheduler) when building the instance’s NUMA topology.

Dependencies

None. numa-aware-live-migration is required to support live migration of instances with any CPU pinning and NUMA topology functionality enable. However, given that live migration is currently broken for all NUMA use cases, the lack of live migration support for this particular NUMA’y use case should not be considered a blocker.

Testing

Like anything else that utilizes real hardware, this cannot be tested in the upstream CI. Instead, we will need to rely on unit tests, functional tests (with hardware mocked out), and downstream or third-party CIs to provide testing.

Documentation Impact

The feature will need to be documented in full, though we do get the configuration documentation for free (thanks oslo_config.sphinxext).

History

Revisions

Release Name

Description

Rocky

Introduced