Mediated device live migration with libvirt

https://blueprints.launchpad.net/nova/+spec/libvirt-mdev-live-migrate

Starting with libvirt-8.6.0, QEMU-8.1.0 and Linux kernel 5.18.0, guests using mediated devices can be live migrated by using a target mediated device using the same mediated device type (and we don’t need to unplug/plug the mdevs). Now, we need to support this for Nova, which means that Nova should provide a target mediated device UUID (that exists) to the source compute service by the pre-live-migrating call so the target XML created by the source would use it.

Problem description

For the moment, this is not possible to live-migrate an instance if it uses a mediated device as the target wouldn’t create it. You can only for the moment cold-migrate the instance or do other move operations like shelve. Fortunately, libvirt 8.5.0 now supports to live-migrate a guest by using a target mediated device uuid in the target XML so we want to directly support this in Nova.

Use Cases

As an operator, I want to move my instance using a vGPU to another host without the user being aware of it.

As an operator, I want to make sure I will only live-migrate by using the same mediated device type between the source and the target.

Proposed change

In order to succesfully live-migrate a guest with libvirt, you need to modify the target guest XML to use another mediated device using the same mdev (mediated device) type. In order to do it, we propose the following workflow :

First, during the conductor compatibility checks, we will verify the types compatibility on the destination and we will claim for a specific list of target mediated devices (either to be created or just kept reserved) this way :

  • check_can_live_migrate_source() (run on the source) will check the libvirt version of the source and fail by raising a MigrationPreCheckError if the version if below the minimum required (see Dependencies) and only if the instance has mediated devices. It will also check the LibvirtLiveMigrateData version returned by the destination and will raise a MigrationPreCheckError exception if older than the one supporting the new fields (see both Upgrade Impact and Data model impact). Eventually, it will return the list of number of mdevs with their types back to the target in the LibvirtLiveMigrateData object.

  • driver’s post_claim_migrate_data() will first check based on the LibvirtLiveMigrateData object whether the libvirt version is below the minimum required and then check whether those mdev types are compatible with the types the target supports and will raise a MigrationPreCheckError if not. If successful, it will pick N (N being the requested number) of the available mediated resources (either by creating new mdevs or taking existing ones), based on the list that was passed thru LibvirtLiveMigrateData, and will persist that list of target mediated devices in some internal dictionary field of the LibvirtDriver instance, keyed by the instance UUID. We will also pass those mdev uuids in the LibvirtLiveMigrateData object that we return over the wire to the source compute (we will call it later migrate data object).

Note

the current spec proposal is to use the existing NUMA-live-migration related method called post_claim_migrate_data() but we could create a specific new virt driver API method for this usage. This will be discussed at implementation stage.

Later, once the source host starts the live-migration, we will update the guest XML information with those mediated device UUIDs this way :

  • in source’s driver _live_migration_operation() we lookup the migrate data object we got and we update the target guest XML in get_updated_guest_xml() by getting those mediated device UUIDs from the migrate data object.

  • in destination’s driver post_live_migration_at_destination(), we delete the mdevs tracked in the internal dictionary field of the LibvirtDriver instance by getting them from the dictionary which is keyed by the instance UUID.

In case of any live migration abort or exception, the residue we only need to clean up is basically the list of claimed mediated devices for the migration that are set in the dictionary field of the LibvirtDriver instance. Accordingly, we propose to delete those records this way :

  • if the exception occurred during pre-livemigration, it eventually calls on destination rollback_live_migration_at_destination() depending on _live_migration_cleanup_flags() result. We will modify that verification method to lookup whether we have mediated device UUIDs in the migrate data object. Then, rollback_live_migration_at_destination() will again look at the dictionary to know which mediated devices to remove from the internal dictionary in the LibvirtDriver.

  • if the exception happened during the live-migration (or if the operator asked to abort it), then it eventually calls _rollback_live_migration() which also calls rollback_live_migration_at_destination() like above, so it would also remove the mdevs from the LibvirtDriver dictionary field.

As a side note, the current method we have for knowing which mediated devices are used by instances will be modified to also take in account the list of mediated devices that are currently set in internal directory field of the LibvirtDriver we’ll be using for tracking which mdevs are claimed for migrations.

Alternatives

Operators could continue to only do cold migrations or we could try to unplug and then plug mediated devices during live-migration like we do at the moment for SR-IOV VFs.

Data model impact

While we won’t describe the internal dictionary we would use in the LibvirtDriver class instance as this is just an implementation question, we still need to explain which objects will be passed between computes RPC services. As we said earlier, we need to augment the LibvirtLiveMigrateData object.

New fields will be added on that object (we can create a nested object if people prefer):

  • source_mdev_types: fields.DictOfStringsField() : dictionary where the key is a source mediated device UUID and the value is its mdev type.

  • target_mdevs: fields.DictOfStringsField() : dictionary where the key is a mediated device UUID of the source and the value a mdev UUID of the target, implicitly matching the relationship between both for the live-migration.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

Operators wanting to use vGPU live-migration will need to support a recent libvirt release, so they probably need to upgrade their OS. They will also need to upgrade all their compute services, see Upgrade Impact for more details.

Developer impact

None.

Upgrade impact

Operators will need to make sure that the target computes are upgraded. That said, given if the destination is not upgraded (and then doesn’t support live migration), then it would return a LibvirtLiveMigrateData object with a previous version. The source will know that the target doesn’t support it and will accordingly raise MigrationPreCheckError (we detailed that above in Proposed change).

Implementation

Assignee(s)

Primary assignee:

sylvain-bauza

Other contributors:

None

Feature Liaison

N/A

Work Items

  • add the LibvirtDriver internal dictionary

  • augment the LibvirtLiveMigrateData object

  • add the conductor checks

  • add the live-migration changes

Dependencies

As said above, it requires : - libvirt-8.6.0 and newer - QEMU-8.1.0 and newer - Linux kernel 5.18.0 and newer

Testing

Unit and functional tests are a very bare minimum but we’re actively chasing the idea to use the mtty kernel samples framework as a way to do some Tempest testing that’s yet unwritten. We may need to build a custom kernel in order to get the latest version of mtty that includes live-migration support.

Documentation Impact

We’ll augment the usual virtual GPU documentation with a section on how to live-migrate and its requirements.

As a note, the specific proprietary nVidia vfio-mdev driver that provides mediated device types and live-migration support currently has limitations and doesn’t support pausing a VM and autoconverge feature. Besides, live-migration downtime is very depending on the hardware so we somehow need to document those hardware-specific knobs in some abstract manner in our upstream docs, pointing as much as we can to the vendor documentation if existing.

References

History

Revisions

Release Name

Description

2024.1 Caracal

Introduced