Neutron SR-IOV Port Live Migration¶
https://blueprints.launchpad.net/nova/+spec/libvirt-neutron-sriov-livemigration
When nova was extended to support SR-IOV by [0], live migration was not directly addressed as part of the proposed changes. As a result while live migration with SR-IOV is technically feasible, it remains unsupported by the libvirt driver. This spec seeks to address this gap in live migration support for the libvirt virt driver.
Problem description¶
Live Migration with SR-IOV devices has several complicating factors.
NUMA affinity
hardware state
SR-IOV mode
resource claims
NUMA affinity is out of the scope of this spec and will be addressed separately by [1].
The SR-IOV mode of a neutron port directly impacts how a live migration
can be done. This spec will focus on the two categories of SR-IOV primarily,
direct passthrough (vnic_type=direct|direct-physical
) and indirect
passthrough (vnic_type=macvtap|virtio-forwarder
). For simplicity, direct
mode and indirect mode are used instead of vnic_type
in the rest
of the spec.
When a device is exposed to a guest via direct mode SR-IOV, maximum performance is achieved at the cost of exposing the guest to the hardware state. Since there is no standard mechanism to copy the hardware state, direct mode SR-IOV cannot be conventionally live migrated. This spec will provide a workaround to enable this configuration.
Hardware state transfer is a property of SR-IOV live migration that cannot be addressed by OpenStack, as such this spec does not intend to copy hardware state. Copying hardware state requires explicit support at the hardware, driver and hypervisor level which does not exist for SR-IOV devices.
Note
hardware state in this context refers to any NIC state such as offload state or Tx/Rx queues that are implemented in hardware which is not software programmable via the hypervisor e.g. MAC address is not considered hardware state in this context as libvirt/qemu can set the MAC address of an SR-IOV device via a standard host level interface.
For SR-IOV indirect mode, the SR-IOV device is exposed via a software mediation layer such as macvtap + kernel vhost, vhost-user or vhost-vfio. From a guest perspective, the SR-IOV interfaces are exposed as virtual NICs and no hardware state is observed. Indirect mode SR-IOV therefore allows migration of guests without any workarounds.
The main gap in SR-IOV live migration support today is resource claims. As mentioned in the introduction it is technically possible to live migrate a guest with an indirect mode SR-IOV device between two hosts today, however, when you do, resources are not correctly claimed. By not claiming the SR-IOV device an exception is raised after the VM has been sucessfully migrated to the destination in the post migration cleanup.
When live migrating today, migration will also fail if the PCI mapping is required to change. Said another way, migration will only succeed if there is a free PCI device on the destination node with the same PCI address as the source node that is connected to the same physnet and is on the same NUMA node.
This is because of two issues. Firstly, nova does not correctly claim the SR-IOV device on the destination node and second, nova does not modify the guest XML to reflect the host PCI address on the destination.
As a result of the above issues, SR-IOV live migration in the libvirt driver is currently incomplete and incorrect even when the VM is successfully moved.
Use Cases¶
As a telecom operator with stateful VNF such as a vPE Router that has a long peering time, I would like to be able to utilise direct mode SR-IOV to meet my performance SLAs but desire the flexibility of live migration for maintenance. To that end, as an operator, I am willing to use a bond in the guest to a vSwitch or indirect SR-IOV interface to facilitate migration and retain peering relationships while understanding performance SLAs will not be met during the migration.
As the provider of a cloud with a hardware offloaded vSwitch that leverages indirect mode SR-IOV, I want to offer the performance it enables to my customers but also desire the flexibility to be able to transparently migrate guests without disrupting traffic to enable maintenance.
Proposed change¶
This spec proposes addressing the problem statement in several steps.
Resource claims¶
Building on top of the recently added multiple port binding feature this
spec proposes to extend the existing check_can_live_migrate_destination
function to claim SR-IOV devices on the destination node via the PCI resource
tracker. If claiming fails then the partially claimed resources will be
released and check_can_live_migrate_destination
will fail. If the claiming
succeeds the VIFMigrateData
objects in the LiveMigrateData
object
corresponding to the SR-IOV devices will be updated with the destination
host PCI address. If the migration should fail after the destination resources
have been claimed they must be released in the
rollback_live_migration_at_destination
call. If the migration succeeds
the source host SR-IOV device will be freed in post_live_migration
(clean up source) and the state of claimed devices on the destination are
updated to allocated. By proactively updating the resouce tracker in both the
success and failure case we do not need to rely on the
update_available_resource
periodic task to heal the allocations/claims.
SR-IOV Mode¶
Indirect Mode¶
No other nova modifications are required for indirect mode SR-IOV beyond those already covered in the Resouce claims sechtion.
Direct Mode¶
For direct mode SR-IOV, to enable live migration the SR-IOV devices must
be first detached on the source after pre_live_migrate
and then
reattached in post_live_migration_at_destination
.
This mimics the existing suspend [2] and resume [3] workflow whereby we workaround QEMUs inability to save device state during a suspend to disk operation.
Note
If you want to maintain network connectivity during the
migration, as the direct mode SR-IOV device will be detached,
a bond is required in the guest to a transparently live migratable
interface such as a vSwitch interface or a indirect mode SR-IOV
device. The recently added net_fallback
kernel driver is out
of scope of this spec but could also be used.
XML Generation¶
Indirect mode SR-IOV does not encode the PCI address in the libvirt XML. The XML update logic that was introduced in the multiple port bindings feature is sufficent to enable the indirect use case.
Direct mode SR-IOV does encode the PCI address in the libvirt XML, however, as the SR-IOV devices will be detached before migration and attached after migration no XML updates will be required.
Alternatives¶
As always we could do nothing and continue to not support live migration with SR-IOV devices. In this case, operators would have to continue to fall back on cold migration. As this alternative would not fix the problem of incomplete live migration support additional documentation or optionally a driver level check to reject live migration would be warranted to protect operators that may not be aware of this limitation.
We could add a new API check to determine if an instance has an SR-IOV port and explicitly fail to migrate in this case with a new error.
Data model impact¶
It is expected that no data model changes should be required as the existing
VIF object in the migration_data
object should be able to store the
associated PCI address info. If this is not the case a small extension to
those objects will be required for this info.
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
Users of direct mode SR-IOV should be aware that auto hotplugging is not transparent to the guest in exactly the same way that suspend is not transparent today. This will be recorded in the release notes and live migration documentation.
Performance Impact¶
This should not significantly impact the performance of a live migration. A minor overhead will be incurred in claiming the resource and updating the XML
Other deployer impact¶
SR-IOV live migration will be enabled if both the source and dest node support it. If either compute node does not support this feature the migration will be aborted by the conductor.
Developer impact¶
None
Upgrade impact¶
This feature may aid upgrade of hosts with SR-IOV enabled guests in the future by allowing live migration to be used however, as that will require both the source and dest node to support SR-IOV live migration first. As a result, this feature, will have no impact for this release.
To ensure cross version compatiblity the conductor will validate if the source and destination nodes support this feature following the same pattern that is used to detect if multiple port binding is supported.
When upgrading from stein to train the conductor check will allow this feature to be used with no operator intervention required.
Implementation¶
Assignee(s)¶
- Primary assignee:
sean-k-mooney
- Other contributors:
adrian.chiris
Work Items¶
Spec: Sean-K-Mooney
PCI resource allocation and indirect live-migration support: Adrianc
Direct live-migration support: Sean-K-Mooney
Dependencies¶
This spec has no dependencies but intends to collaborate with the implementation of NUMA aware live migration [1]
Note that modification to the sriovnicswitch ml2 driver may be required to support multiple port bindings. This work if needed is out of scope of this spec and will be tracked using Neutron RFE bugs and/or specs as required.
Testing¶
This feature will be tested primarily via unit and functional tests, as SR-IOV testing is not available in the gate tempest test will not be possible. Third party CI could be implemented but that is not part of the scope of this spec. The use of the netdevsim kernel module to allow testing of SR-IOV without SR-IOV hardware was evaluated. While the netdevsim kernel module does allow the creation of an SR-IOV PF netdev and the allocation of SR-IOV VF netdevs, it does not simulate PCIe devices. As a result in its current form the netdevsim kernel module cannot be used to enable SR-IOV testing in the gate.
Documentation Impact¶
Operator docs will need to be updated to describe the new feature and specify that direct mode auto-attach is not transparent to the guest.
References¶
History¶
Release Name |
Description |
---|---|
Stein |
Proposed |
Train |
Reproposed |