NUMA Topology with Resource Providers

https://blueprints.launchpad.net/nova/+spec/numa-topology-with-rps

Now that Nested Resource Providers is a thing in both Placement API and Nova compute nodes, we could use the Resource Providers tree for explaining the relationship between a root Resource Provider (root RP) ie. a compute node, and one or more Non-Uniform Memory Access (NUMA) nodes (aka. cells), each of them having separate resources, like memory or PCI devices.

Note

This spec only targets to model resource capabilities for NUMA nodes in some general and quite abstract manner. We won’t address in this spec how we should model NUMA-affinized hardware like PCI devices or GPUs and will discuss these relationships in a later spec.

Problem description

The NUMATopologyFilter checks a number of resources, including emulator threads policies, CPU pinned instances and memory page sizes. Additionally, it does two different verifications :

  • whether some host can fit the query because it has enough capacity

  • which resource(s) should be used for this query (eg. which pCPUs or NUMA node)

With NUMA topologies modeled as Placement resources, those two questions could be answered by the Placement service as potential allocation candidates that the filter would only be responsible for choosing between them in some very specific cases (eg. PCI device NUMA affinity, CPU pinning and NUMA anti-affinity).

Accordingly, we could model the host memory and the CPU topologies as a set of resource providers arranged in a tree, and just directly allocate resources for a specific instance from a resource provider subtree representing a NUMA node and its resources.

That said, non resource-related features (like choosing a specific CPU pin within a NUMA node for a vCPU) would still be only done by the virt driver, and are not covered by this spec.

Use Cases

Consider the following NUMA topology for a “2-NUMA nodes, 4 cores” host with no Hyper-Threading:

+--------------------------------------+
|                  CN1                 |
+-+---------------+--+---------------+-+
  |     NUMA1     |  |     NUMA2     |
  +-+----+-+----+-+  +-+----+-+----+-+
    |CPU1| |CPU2|      |CPU3| |CPU4|
    +----+ +----+      +----+ +----+

Here, CPU1 and CPU2 would share the same memory through a common memory controller, while CPU3 and CPU4 would share their own memory.

Ideally, applications that require low-latency memory access from multiple vCPUs on the same instance (for parallel computing reasons) would like to ensure that those CPU resources are provided by the same NUMA node, or some performance penalties would occur (if your application is CPU-bound or I/O-bound of course). For the moment, if you’re an operator, you can use flavor extra specs to indicate a desired guest NUMA topology for your instance like:

$ openstack flavor set FLAVOR-NAME \
    --property hw:numa_nodes=FLAVOR-NODES \
    --property hw:numa_cpus.N=FLAVOR-CORES \
    --property hw:numa_mem.N=FLAVOR-MEMORY

See all the NUMA possible extra specs for a flavor.

Note

The example above is only needed when you want to not evenly divide your virtual CPUs and memory between NUMA nodes, of course.

Proposed change

Given there are a lot of NUMA concerns, let’s do an iterative approach about the model we agree.

NUMA nodes being nested Resource Providers

Given virt drivers can amend a provider tree given by the compute node ResourceTracker, then the libvirt driver could create child providers for each of the 2 sockets representing separate NUMA node.

Since CPU resources are tied to a specific NUMA node, it makes sense to model the corresponding resource classes as part of the child NUMA Resource Providers. In order to facilitate querying NUMA resources, we propose to decorate the NUMA child resource providers with a specific trait named HW_NUMA_ROOT that would be on each NUMA node. That would help to know which hosts would be NUMA-aware and which others are not.

Memory is a bit tougher to represent. The granularity of a NUMA node having an amount of attached memory is somehow a first approach but we’re missing the point that the smallest allocatable unit you can assign with Nova is really a page size. Accordingly, we should rather model our NUMA subtree with children Resource Providers that represent the smallest unit of memory you can allocate, ie. a page size. Since a pagesize is not a consumable amount but rather a qualitative information that helps us to allocate MEMORY_MB resources, we propose three traits :

  • MEMORY_PAGE_SIZE_SMALL and MEMORY_PAGE_SIZE_LARGE would allow us to know whether the memory page size is default or optionally configured.

  • CUSTOM_MEMORY_PAGE_SIZE_<X> where <X> is an integer would allow us to know the size of the page in KB. To make it clear, even if the trait is a custom one, it’s important to have a naming convention for it so the scheduler could ask about page sizes without knowing all the traits.

                                +-------------------------------+
                                |  <CN_NAME>                    |
                                |  DISK_GB: 5                   |
                                +-------------------------------+
                                |  (no specific traits)         |
                                +--+---------------------------++
                                   |                           |
                                   |                           |
            +-------------------------+                   +--------------------------+
            | <NUMA_NODE_O>           |                   | <NUMA_NODE_1>            |
            | VCPU: 8                 |                   | VCPU: 8                  |
            | PCPU: 16                |                   | PCPU: 8                  |
            +-------------------------+                   +--------------------------+
            | HW_NUMA_ROOT            |                   | HW_NUMA_ROOT             |
            +-------------------+-----+                   +--------------------------+
              /                 |    \                                          /+\
              +                 |     \_____________________________          .......
              |                 |                                   \
+-------------+-----------+   +-+--------------------------+   +-------------------------------+
| <RP_UUID>               |   | <RP_UUID>                  |   | <RP_UUID>                     |
| MEMORY_MB: 1024         |   | MEMORY_MB: 1024            |   |MEMORY_MB: 10240               |
| step_size=1             |   | step_size=2                |   |step_size=1024                 |
+-------------------------+   +----------------------------+   +-------------------------------+
|MEMORY_PAGE_SIZE_SMALL   |   |MEMORY_PAGE_SIZE_LARGE      |   |MEMORY_PAGE_SIZE_LARGE         |
|CUSTOM_MEMORY_PAGE_SIZE_4|   |CUSTOM_MEMORY_PAGE_SIZE_2048|   |CUSTOM_MEMORY_PAGE_SIZE_1048576|
+-------------------------+   +----------------------------+   +-------------------------------+

Note

As we said above, we don’t want to support children PCI devices for Ussuri at the moment. Other current children RPs for a root compute node, like ones for VGPU resources or bandwidth resources would still have their parent be the compute node.

NUMA RP

Resource Provider names for NUMA nodes shall follow a convention of nodename_NUMA# where nodename would be the hypervisor hostname (given by the virt driver) and where NUMA# would literally be a string made of ‘NUMA’ postfixed by the NUMA cell ID which is provided by the virt driver.

Each NUMA node would be then a child Resource Provider, having two resource classes :

  • VCPU: for telling how many virtual cores (not able to be pinned) the NUMA node has.

  • PCPU: for telling how many possible pinned cores the NUMA node has.

A specific trait should be decorating it as we explained : HW_NUMA_ROOT.

Memory pagesize RP

Each NUMA RP should have child RPs for each possible memory page size per host, and having a single resource class :

  • MEMORY_MB: for telling how much memory the NUMA node has in that specific page size.

This RP would be decorated by two traits :

  • either MEMORY_PAGE_SIZE_SMALL (default if not configured) or MEMORY_PAGE_SIZE_LARGE (if large pages are configured)

  • the size of the page size : CUSTOM_MEMORY_PAGE_SIZE_# (where # is the size in KB - default to 4 as the kernel defaults to 4KB page sizes)

Compute node RP

The root Resource Provider (ie. the compute node) would only provide resources for classes that are not NUMA-related. Existing children RPs for vGPUs or bandwidth-aware resources should still have this parent (until we discuss about NUMA affinity for PCI devices).

Optionally configured NUMA resources

Given there are NUMA workloads but also non-NUMA workloads, it’s also important for operators to just have compute nodes accepting the latter. That said, having the compute node resources to be split between multiple NUMA nodes could be a problem for those non-NUMA workloads if they want to keep the existing behaviour.

For example, say an instance with 2 vCPUs and one host having 2 NUMA nodes but each one only accepting one VCPU, then the Placement API wouldn’t accept that host (given each nested RP only accepts one VCPU). For that reason, we need to have a configuration for saying which resources should be nested. To reinforce the above, that means a host would be either NUMA or non-NUMA, hence non-NUMA workloads being set on a specific NUMA node if host is set so. The proposal we make here will be :

[compute]
enable_numa_reporting_to_placement = <bool> (default None for Ussuri)

For below, we will tell hosts as “NUMA-aware” ones that have this option be True. For hosts that have this option to False they are explicitely asked to have a legacy behaviour and will be called “non-NUMA-aware”.

Depending on the value of the option, Placement would accept or not a host for the according request. The resulting matrix can be:

+----------------------------------------+----------+-----------+----------+
| ``enable_numa_reporting_to_placement`` | ``None`` | ``False`` | ``True`` |
+========================================+==========+===========+==========+
| NUMA-aware flavors                     | Yes      | No        | Yes      |
+----------------------------------------+----------+-----------+----------+
| NUMA-agnostic flavors                  | Yes      | Yes       | No       |
+----------------------------------------+----------+-----------+----------+

where Yes means that there could be allocation candidates from this host, while No means that no allocation candidates will be returned.

In order to distinghish compute nodes that have the False value instead of None, we will decorate the former with a specific trait name HW_NON_NUMA. Accordingly, we will query Placement by adding this forbidden trait for not getting nodes that operators explicitly don’t want them to support NUMA-aware flavors.

Note

By default, the value for that configuration option will be None for upgrade reasons. By the Ussuri timeframe, operators will have to decide which hosts they want to support NUMA-aware instances and which should be dedicated for ‘non-NUMA-aware’ instances. A nova-status pre-upgrade check command will be provided that will warn them to decide before upgrading to Victoria, if the default value is about to change as we could decide later in this cycle. Once we stop supporting None (in Victoria or later), the HW_NON_NUMA trait would no longer be needed so we could stop querying it.

Note

Since we allow a transition period for helping the operators to decide, we will also make clear that this is a one-way change and that we won’t provide a backwards support for turning a NUMA-aware host into a non-NUMA-aware host.

See the Upgrade impact section for further details.

Note

Since the discovery of a NUMA topology is made by virt drivers, it makes the population of those nested Resource Providers to necessarly be done by each virt driver. Consequently, while the above configuration option is said to be generic, the use of this option for populating the Resource Providers tree will only be done by the virt drivers. Of course, a shared module could be imagined for the sake of consistency between drivers, but this is an implementation detail.

The very simple case: I don’t care about a NUMA-aware instance

For flavors just asking for, say, vCPUs and memory without asking them to be NUMA-aware, then we will make a single Placement call asking to not land them on a NUMA-aware host:

resources=VCPU:<X>,MEMORY_MB=<Y>
&required=!HW_NUMA_ROOT

In this case, even if NUMA-aware hosts have enough resources for this query, the Placement API won’t provide them but only non-NUMA-aware ones (given the forbidden HW_NUMA_ROOT trait). We’re giving the possibility to the operator to shard their clouds between NUMA-aware hosts and non-NUMA-aware hosts but that’s not really changing the current behaviour as of now where operators create aggregates to make sure non-NUMA-aware instances can’t land on NUMA-aware hosts.

See the Upgrade impact session for rolling upgrade situations where clouds are partially upgraded to Ussuri and where only a very few nodes are reshaped.

Asking for NUMA-aware vCPUs

As NUMA-aware hosts have a specific topology with memory being in a grand-child RP, we basically need to ensure we can translate the existing expressiveness in the flavor extra specs into a Placement allocation candidates query that asks for parenting between the NUMA RP containing the VCPU resources and the memory pagesize RP containing the MEMORY_MB resources.

Accordingly, here are some examples:

  • for a flavor of 8 VCPUs, 8GB of RAM and hw:numa_nodes=2:

    resources_MEM1=MEMORY_MB:4096
    &required_MEM1=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC1=VCPU:4
    &required_NUMA1=HW_NUMA_ROOT
    &same_subtree=_MEM1,_PROC1,_NUMA1
    &resources_MEM2=MEMORY_MB:4096
    &required_MEM2=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC2=VCPU:4
    &required_NUMA2=HW_NUMA_ROOT
    &same_subtree=_MEM2,_PROC2,_NUMA2
    &group_policy=none
    

Note

We use none as a value for group_policy which means that in this example, allocation candidates can all be from PROC1 group meaning that we defeat the purpose of having the resources separated into different NUMA nodes (which is the purpose of hw:numa_nodes=2). This is OK as we will also modify the NUMATopologyFilter to only accept allocation candidates for a host that are in different NUMA nodes. It will probably be implemented in the nova.virt.hardware module but that’s an implementation detail.

  • for a flavor of 8 VCPUs, 8GB of RAM and hw:numa_nodes=1:

    resources_MEM1=MEMORY_MB:8192
    &required_MEM1=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC1=VCPU:8
    &required_NUMA1=HW_NUMA_ROOT
    &same_subtree=_MEM1,_PROC1,_NUMA1
    
  • for a flavor of 8 VCPUs, 8GB of RAM and hw:numa_nodes=2&hw:numa_cpus.0=0,1&hw:numa_cpus.1=2,3,4,5,6,7:

    resources_MEM1=MEMORY_MB:4096
    &required_MEM1=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC1=VCPU:2
    &required_NUMA1=HW_NUMA_ROOT
    &same_subtree=_MEM1,_PROC1,_NUMA1
    &resources_MEM2=MEMORY_MB:4096
    &required_MEM2=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC2=VCPU:6
    &required_NUMA2=HW_NUMA_ROOT
    &same_subtree=_MEM2,_PROC2,_NUMA2
    &group_policy=none
    
  • for a flavor of 8 VCPUs, 8GB of RAM and hw:numa_nodes=2&hw:numa_cpus.0=0,1&hw:numa_mem.0=1024 &hw:numa_cpus.1=2,3,4,5,6,7&hw:numa_mem.1=7168:

    resources_MEM1=MEMORY_MB:1024
    &required_MEM1=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC1=VCPU:2
    &required_NUMA1=HW_NUMA_ROOT
    &same_subtree=_MEM1,_PROC1,_NUMA1
    &resources_MEM2=MEMORY_MB:7168
    &required_MEM2=MEMORY_PAGE_SIZE_SMALL
    &resources_PROC2=VCPU:6
    &required_NUMA2=HW_NUMA_ROOT
    &same_subtree=_MEM2,_PROC2,_NUMA2
    &group_policy=none
    

As you can understand, the VCPU and MEMORY_MB values will be a result of the division of respectively the flavored vCPUs and the flavored memory by the value of hw:numa_nodes (which is actually already calculated and provided as NUMATopology object information in the RequestSpec object).

Note

The translation mechanism from a flavor-based request into Placement query will be handled by the scheduler service.

Note

Since memory is provided as grand-child, we need to always ask for a MEMORY_PAGE_SIZE_SMALL which is the default.

Asking for specific memory page sizes

Operators defining a flavor of 2 vCPUs, 4GB of RAM and hw:mem_page_size=2MB,hw:numa_nodes=2 will see that the Placement query will become:

resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=CUSTOM_MEMORY_PAGE_SIZE_2048
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=CUSTOM_MEMORY_PAGE_SIZE_2048
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none

If you only want large page size support without really specifying which size (eg. by specifying hw:mem_page_size=large instead of, say, 2MB), then the above same request for large pages would translate into:

resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=MEMORY_PAGE_SIZE_LARGE
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=MEMORY_PAGE_SIZE_LARGE
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none

Asking the same with hw:mem_page_size=small would translate into:

resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=MEMORY_PAGE_SIZE_SMALL
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none

And eventually, asking with hw:mem_page_size=any would mean:

resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none

Note

As we said for vCPUs, given we query with group_policy=none, allocation candidates would be within the same NUMA node but that’s fine since we also said that the scheduler filter would then no agree with them if there is a hw:numa_nodes=X there.

The fallback case for NUMA-aware flavors

In the Optionally configured NUMA resources section, we said that we would want to accept NUMA-aware flavors to land on hosts that have the enable_numa_reporting_to_placement option set to None. Since we can’t yet build a OR query for allocation candidates, we propose to make another call to Placement. In this specific call (we name it a fallback call), we want to get all non-reshaped nodes that are not explicitly said to not support NUMA. In this case, the request is fairly trivial since we decorated them with the HW_NON_NUMA trait:

resources=VCPU:<X>,MEMORY_MB=<Y>
&required=!HW_NON_NUMA,!HW_NUMA_ROOT

Then we would get all compute nodes that have the None value ( including nodes that are still running the Train release in a rolling upgrade fashion).

Of course, we would get nodes that could potentially not accept the NUMA-aware flavor but we rely on the NUMATopologyFilter for not selecting them, exactly like what we do in Train.

There is some open question about whether we should do the fallback call only if the NUMA-specific call is not getting candidates or if we should generate the two calls either way and merge the results. The former is better for performance reasons since we avoid a potentially unnecessary call but would generate some potential spread/pack affinity issues. Here we all agree on the fact we can leave the question unresolved for now and defer the resolution to the implementation phase.

Alternatives

Modeling of NUMA resources could be done by using specific NUMA resource classes, like NUMA_VCPU or NUMA_MEMORY_MB that would only be set for children NUMA resource providers, and where VCPU and MEMORY_MB resource classes would only be set on the root Resource Provider (here the compute node).

If the Placement allocations candidates API was also able to provide a way to say ‘you can split the resources between resource providers’, we wouldn’t need to carry a specific configuration option for a long time. All hosts would then be reshaped to be NUMA-aware but then non-NUMA-aware instances could potentially land on those hosts. That wouldn’t change the fact that for optimal capacity, operators need to shard their clouds between NUMA workloads and non-NUMA ones, but from a Placement perspective, all hosts would be equal. This alternative proposal has largely already been discussed in a spec but the outcome consensus was that it was very difficult to implement and potentially not worth the difficulty.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None, flavors won’t need to be modified since we will provide a translation mechanism. That said, we will explicitly explain in the documentation that we won’t support any placement-like extra specs in flavors.

Performance Impact

Only when changing the configuration option to True, a reshape is done.

Other deployer impact

Operators would want to migrate some instances from hosts to anothers before explicitely enabling or disabling NUMA awareness on their nodes since they will have to consider the capacity usage accordingly as they will have to shard their cloud. This being said, this would only be necessary for clouds that weren’t yet already dividing NUMA-aware and non-NUMA-aware workloads between hosts thru aggregates.

Developer impact

None, except virt driver maintainers.

Upgrade impact

As described above, in order to prevent a flavor update during upgrade, we will provide a translation mechanism that will take the existing flavor extra spec properties and transform them into Placement numbered groups query.

Since there will be a configuration option for telling that a host would become NUMA-aware, the corresponding allocations accordingly have to change hence the virt drivers be responsible for providing a reshape mechanism that will eventually call the Placement API /reshaper endpoint when starting the compute service. This reshape implementation will absolutely need to consider the Fast Forward Upgrade (FFU) strategy where all controlplane is down and should possibly document any extra step required for FFU with an eventual removal in a couple of releases once all deployers no longer need this support.

Last but not the least, we will provide a transition period (at least during the Ussuri timeframe) where operators can decide which hosts to dedicate to NUMA-aware workloads. A specific nova-status pre-upgrade check command will warn them to do so before upgrading to Victoria.

Implementation

Assignee(s)

  • bauzas

  • sean-k-mooney

Feature Liaison

bauzas

Work Items

  • libvirt driver passing NUMA topology through update_provider_tree() API

  • Hyper-V driver passing NUMA topology through update_provider_tree() API

  • Possible work on the NUMATopologyFilter to look at the candidates

  • Scheduler translating flavor extra specs for NUMA properties into Placement queries

  • nova-status pre-upgrade check command

Dependencies

None.

Testing

Functional tests and unittests.

Documentation Impact

None.