NUMA-aware live migration

https://blueprints.launchpad.net/nova/+spec/numa-aware-live-migration

When an instance with NUMA characteristics is live-migrated, those characteristics are not recalculated on the destination compute host. In the CPU pinning case, using the source host’s pin mappings on the destination can lead to multiple instances being pinned to the same pCPUs. In the case of hugepage-backed instances, which are NUMA-localized, an instance needs to have its NUMA mapping recalculated on the destination compute host during a live migration.

Problem description

In the following paragraphs the term NUMA is incorrectly used to signify any guest characteristic that is expressed in the InstanceNUMATopology object, for example CPU pinning and hugepages. CPU pinning can be achieved without a guest NUMA topology, but because no better term than NUMA is available it will continue to be used.

The problem can best be described with three examples.

The first example is live migration with CPU pinning. An instance with a dedicated CPU policy and pinned CPUs is live-migrated. Its pin mappings are naively copied over to the destination host. This creates two problems. First, its pinned pCPUs aren’t properly claimed on the destination. This means that, should a second instance with pinned CPUs land on the destination, both instances’ vCPUs could be pinned to the same pCPUs. Second, any existing pin mappings on the destination are ignored. If another instance already exists on the destination, both instances’s vCPUs could be pinned to the same pCPUs. In both cases, the dedicated CPU policy is violated, potentially leading to unpredictable performance degradation.

The second example is instances with hugepages. There are two hosts, each with two NUMA nodes and 8 1GB hugepages per node. Two identical instances are booted on the two hosts. Their virtual NUMA topology is one virtual NUMA node and 8 1GB memory pages. They land on their respective host’s NUMA node 0, consuming all 8 of its pages. One instance is live-migrated to the other host. The libvirt driver enforces strict NUMA affinity and does not regenerate the instance XML. Both instances end up on the hosts NUMA node 0, and the live-migrated instance fails to run.

The third example is an instance with a virtual NUMA topology (but without hugepages). If an instance affined to its host’s NUMA node 2 is live migrated to a host with only two NUMA nodes, and thus without a NUMA node 2, it will fail to run.

The first two of these examples are known bugs [1] [2].

Use Cases

As a cloud administrator, I want to live migrate instances with CPU pinning without the pin mappings overlapping on the destination compute host.

As a cloud administrator, I want live migration of hugepage-backed instances to work and for the instances to successfully run on the destination compute host.

As a cloud administrator, I want live migration of instances with an explicit NUMA topology to work and for the instances to successfully run on the destination compute host.

Proposed change

Currently, the scheduler does not claim any NUMA resources. While work has started to model NUMA topologies as resources providers in placement [3], this spec intentionally ignores that work and does not depend on it. Instead, the current method of claiming NUMA resources will continue to be used. Specifically, NUMA resources will continue to be claimed by the compute host’s resource tracker.

At the cell conductor (live migration isn’t supported between cells, so the superconductor is not involved) and compute level, the relevant parts of the current live migration flow can be summarized by the following oversimplified pseudo sequence diagram.:

+-----------+                        +---------+                             +-------------+ +---------+
| Conductor |                        | Source  |                             | Destination | | Driver  |
+-----------+                        +---------+                             +-------------+ +---------+
      |                                   |                                         |             |
      | check_can_live_migrate_destination|                                         |             |
      |---------------------------------------------------------------------------->|             |
      |                                   |                                         |             |
      |                                   |           check_can_live_migrate_source |             |
      |                                   |<----------------------------------------|             |
      |                                   |                                         |             |
      |                                   | migrate_data                            |             |
      |                                   |---------------------------------------->|             |
      |                                   |                                         |             |
      |                                   |                            migrate_data |             |
      |<----------------------------------------------------------------------------|             |
      |                                   |                                         |             |
      | live_migration(migrate_data)      |                                         |             |
      |---------------------------------->|                                         |             |
      |                                   |                                         |             |
      |                                   | pre_live_migration(migrate_data)        |             |
      |                                   |---------------------------------------->|             |
      |                                   |                                         |             |
      |                                   |                            migrate_data |             |
      |                                   |<----------------------------------------|             |
      |                                   |                                         |             |
      |                                   | live_migration(migrate_data)            |             |
      |                                   |------------------------------------------------------>|
      |                                   |                                         |             |

migrate_data is a LiveMigrateData object. This spec proposes to add an object field containing an InstanceNUMATopology object. The source will include the instance’s existing NUMA topology in the migrate_data that its check_can_live_migrate_source returns to the destination. The destination’s virt driver will fit this InstanceNUMATopology to the destination’s NUMATopology and claim the resources using the resource tracker. It will then send the updated InstanceNUMATopology back to the conductor as part of the existing migrate_data that check_can_live_migrate_destination returns. The updated InstanceNUMATopology will continue to be propagated as part of migrate_data, eventually reaching the source. The source’s libvirt driver will use this updated InstanceNUMATopology when generating the instance XML to be sent to the destination for the live migration. The proposed flow is summarised in the following diagram.:

+-----------+                                                   +---------+                                +-------------+                                      +---------+
| Conductor |                                                   | Source  |                                | Destination |                                      | Driver  |
+-----------+                                                   +---------+                                +-------------+                                      +---------+
      |                                                              |                                            |                                                  |
      | check_can_live_migrate_destination                           |                                            |                                                  |
      |---------------------------------------------------------------------------------------------------------->|                                                  |
      |                                                              |                                            |                                                  |
      |                                                              |              check_can_live_migrate_source |                                                  |
      |                                                              |<-------------------------------------------|                                                  |
      |                                                              |                                            |                                                  |
      |                                                              | migrate_data + InstanceNUMATopology        |                                                  |
      |                                                              |------------------------------------------->|                                                  |
      |                                                              |                                            | --------------------------------------------\    |
      |                                                              |                                            |-| Fit InstanceNUMATopology to NUMATopology, |    |
      |                                                              |                                            | | fail live migration if unable             |    |
      |                                                              |                                            | |-------------------------------------------|    |
      |                                                              |    migrate_data + new InstanceNUMATopology |                                                  |
      |<----------------------------------------------------------------------------------------------------------|                                                  |
      |                                                              |                                            |                                                  |
      | live_migration(migrate_data + new InstanceNUMATopology)      |                                            |                                                  |
      |------------------------------------------------------------->|                                            |                                                  |
      |                                  --------------------------\ |                                            |                                                  |
      |                                  | pre_live_migration call |-|                                            |                                                  |
      |                                  |-------------------------| |                                            |                                                  |
      |                                                              |                                            |                                                  |
      |                                                              | live_migration(migrate_data + new InstanceNUMATopology)                                       |
      |                                                              |---------------------------------------------------------------------------------------------->|
      |                                                              |                                            |            ------------------------------------\ |
      |                                                              |                                            |            | generate NUMA XML for destination |-|
      |                                                              |                                            |            |-----------------------------------| |
      |                                                              |                                            |                                                  |

Exchanging instance NUMA topologies is done early (in check_can_live_migrate_source rather than pre_live_migration) in order to fail as fast as possible if the destination cannot fit the instance. What happens when the compute hosts are not both running the updated handshake code is discussed in ref:upgrade-impact.

Currently, only placement allocations are updated during a live migration. The proposed resource tracker claims mechanism will become obsolete once NUMA resource providers are implemented [3]. Therefore, as a stopgap error handling method, the live migration can be failed if the resource claim does not succeed on the destination compute host. Once NUMA is handled by placement, the compute host will not have to do any resource claims.

It would also be possible for another instance to steal NUMA resources from a live migrated instance before the latter’s destination compute host has a chance to claim them. Until NUMA resource providers are implemented [3] and allow for an essentially atomic schedule+claim operation, scheduling and claiming will keep being done at different times on different nodes. Thus, the potential for races will continue to exist.

Alternatives

It would be possible to reuse the result of numa_fit_instance_to_host as called from the scheduler before the live migration reaches the conductor. select_destinations in the scheduler returns a list of Selection objects to the conductor’s live migrate task. The Selection object could be modified to include InstanceNUMATopology. The NUMA topology filter could add an InstanceNUMATopology for every host that passes. That topology would eventually reach the conductor, which would put it in migrate_data. The destination compute host would then claim the resources as previously described.

Data model impact

InstanceNUMATopology is added to LiveMigrateData.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

None.

Hypothetically, how NUMA aware live migration could be supported between version-mismatched compute hosts would depend on which of the two compute hosts is older.

If the destination is older than the source, the source does not get an InstanceNUMATopology in migrate_data and can therefore choose to run an old-style live migration.

If the source is older than the destination, the new field in LiveMigrateData is ignored and the source’s old live migration runs without issues. However, the destination has already claimed NUMA resources that the source does generate instance XML for. The destination could conceivably check the source’s compute service version and fail the migration before claiming resources if the source doesn’t support NUMA live migration.

However, given the current broken state of NUMA live migration, a simpler solution is to refuse to perform a NUMA live migration unless both source and destination compute hosts have been upgraded to a version that supports it. To achieve this, the conductor can check the source and destination compute’s service version and fail the migration if either one is too old.

Implementation

Assignee(s)

Primary assignee:

notartom

Work Items

  • Add InstanceNUMATopology to LiveMigrateData.

  • Modify the libvirt driver to generate live migration instance XML based on the InstanceNUMATopolgy in the migrate_data it receives from the destination.

Dependencies

None.

Testing

The libvirt/qemu driver used in the gate does not currently support NUMA features (though work is in progress [4]). Therefore, testing NUMA aware live migration in the upstream gate would require nested virt. In addition, the only assertable outcome of a NUMA live migration test (if it ever becomes possible) would be that the live migration succeeded. Examining the instance XML to assert things about its NUMA affinity or CPU pin mapping is explicitly out of tempest’s scope. For these reasons, NUMA aware live migration is best tested in third party CI [5] or other downstream test scenarios [6].

Documentation Impact

Current live migration documentation does not mention the NUMA limitations anywhere. Therefore, a release note explaining the new NUMA capabilities of live migration should be enough.

References

History

Revisions

Release Name

Description

Rocky

Introduced