I/O (PCIe) based NUMA scheduling

https://blueprints.launchpad.net/nova/+spec/input-output-based-numa-scheduling

I/O based NUMA scheduling will add support for intelligent NUMA node placement for guests that have been assigned a host PCI device, avoiding unnecessary memory transactions

Problem description

Currently it is common for virtualisation host platforms to exhibit multi NUMA node characteristics.

An optimal configuration would be where the guests assigned PCI device and RAM allocation are associated with the same NUMA node. This will ensure there is no cross NUMA node memory traffic.

To reach a remote NUMA node the memory request must traverse the inter CPU link and use the remote NUMA nodes associated memory controller to access the remote node. This incurs a latency penalty on remote NUMA node memory access which is not desirable for NFV workloads.

Openstack needs to offer more fine grained control of NUMA configuration in order to deliver higher performing, lower latency guest applications. The default guest placement policy is to use any available pCPU or NUMA node.

Proposed change

Libvirt now provides the numa node a PCI device is associated with, we will use this information to populate the nova DB. For versions of libvirt that do not provide this information we will add a fall back mechanism to query the host for this info.

Logic will be added to nova scheduler to allow it decide on which host is best able satisfy a guests PCI NUMA node requirements.

Logic, similar to what will be implemented in the nova scheduler will be added to the libvirt driver to allow it decide on which NUMA node to place the guest.

Alternatives

Libvirt supports integration with a NUMA daemon (numad) that monitors NUMA topology and usage. It attempts to locate guests for optimum NUMA locality, dynamically adjusting to changing system conditions.

This is insufficient because we need this intelligence in nova for host selection and node deployment.

Data model impact

The PciDevice model will be extended to add a field identifying the NUMA node that PCI device is associated with.

numa_node = Column(Integer, nullable=False, default=”-1”)

A DB migration script will use ALTER_TABLE to add a new column to the pci_devices table in nova DB.

REST API impact

There will be no change to the REST API.

Security impact

This blueprint does not introduce any new security issues.

Notifications impact

This blueprint does not introduce new notifications.

Other end user impact

This blueprint adds no other end user impact.

Performance Impact

The benefits of associating a guests PCI device and RAM allocation with the same NUMA node will provides an optimal configuration that will give improved I/O throughput and reduced memory latencies, compared with the default libvirt guest placement policy.

This feature will add some scheduling overhead, but this overhead will deliver improved performance on the host.

The optimisation described here is dependent on the guest CPU and RAM allocation being associated with the same NUMA node. This feature is described in the “Virt driver guest NUMA node placement & topology” blueprint referenced in the dependency section.

Other deployer impact

To use this feature the deployer must use HW that is capable of reporting numa related info to the OS.

Developer impact

This blueprint will have no developer impact.

Implementation

Assignee(s)

Primary assignee:

James Chapman

Other contributors:

Przemyslaw Czesnowicz Sean Mooney Adrian Hoban

Work Items

  • Add a NUMA node attribute to the pci_device object

  • Use libvirt to discover hosts PCI device NUMA node association

  • Enable nova compute synchronise PCI device NUMA node associations with nova DB

  • Enable libvirt driver configure guests with requested PCI device NUMA node associations

  • Enable the nova scheduler decide on which host is best able to support a guest

  • Enable libvirt driver decide on which NUMA node to place a guest

Dependencies

The blueprint listed below will define a policy used by the scheduler to decide on which host to place a guest. We plan to respect this policy while extending it to add support for a PCI devices NUMA node association.

Virt driver guest NUMA node placement & topology * https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement

The blueprint listed below will support use cases requiring SR-IOV NICs to participate in neutron managed networks.

Enable a nova instance to be booted up with neutron SRIOV ports * https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov

Testing

Scenario tests will be added to validate these modifications.

Documentation Impact

This feature will not add a new scheduling filter, but as it depends on the bp mentioned in the dependency section we will need to extend their filter. We will add documentation as required.

References

Support for NUMA and VCPU topology configuration * https://blueprints.launchpad.net/nova/+spec/virt-driver-guest-cpu-memory-placement

Virt driver guest NUMA node placement & topology * https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement

Enable a nova instance to be booted up with neutron SRIOV ports * https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov