libvirt driver launching AMD SEV-encrypted instances¶
This spec proposes work required in order for nova’s libvirt driver to support launching of KVM instances which are encrypted using AMD’s SEV (Secure Encrypted Virtualization) technology.
While data is typically encrypted today when stored on disk, it is stored in DRAM in the clear. This can leave the data vulnerable to snooping by unauthorized administrators or software, or by hardware probing. New non-volatile memory technology (NVDIMM) exacerbates this problem since an NVDIMM chip can be physically removed from a system with the data intact, similar to a hard drive. Without encryption any stored information such as sensitive data, passwords, or secret keys can be easily compromised.
AMD’s SEV offers a VM protection technology which transparently encrypts the memory of each VM with a unique key. It can also calculate a signature of the memory contents, which can be sent to the VM’s owner as an attestation that the memory was encrypted correctly by the firmware. SEV is particularly applicable to cloud computing since it can reduce the amount of trust VMs need to place in the hypervisor and administrator of their host system.
As a cloud administrator, in order that my users can have greater confidence in the security of their running instances, I want to provide a flavor containing an SEV-specific required trait extra spec which will allow users booting instances with that flavor to ensure that their instances run on an SEV-capable compute host with SEV encryption enabled.
As a cloud user, in order to not have to trust my cloud operator with my secrets, I want to be able to boot VM instances with SEV functionality enabled.
For Stein, the goal is a minimal but functional implementation which would satisfy the above use cases. It is proposed that initial development and testing would include the following deliverables:
Add detection of host SEV capabilities. Logic is required to check that the various layers of the hardware and software hypervisor stack are SEV-capable:
The presence of the following XML in the response from a libvirt virConnectGetDomainCapabilities() API call indicates that both QEMU and the AMD Secure Processor (AMD-SP) support SEV functionality:
<domainCapabilities> ... <features> ... <sev supported='yes'/> ... </sev> </features> </domainCapabilities>
This functionality-oriented check should preempt the need for any version checking in the driver.
/sys/module/kvm_amd/parameters/sevshould have the value
1to indicate that the kernel has SEV capabilities enabled. This should be readable by any user (i.e. even non-root).
Note that both checks are required, since the presence of the first does not imply the second.
Support a standard trait which would be automatically detected per compute host based on the above logic. This would most likely be called
HW_CPU_AMD_SEVor similar, as an extension of the existing CPU traits mapping.
When present in the flavor, this standard trait would indicate that the libvirt driver should include extra XML in the guest’s domain definition, in order to ensure the following:
SEV security is enabled.
The boot disk cannot be
virtio-blk(due to a resource constraint w.r.t. bounce buffers).
The VM uses machine type
q35and UEFI via OVMF. (
q35is required in order to bind all the virtio devices to the PCIe bridge so that they use virtio 1.0 and not virtio 0.9, since QEMU’s
iommu_platformfeature is added in virtio 1.0 only.)
onfor all virtio devices. Despite the name, this does not require the guest or host to have an IOMMU device, but merely enables the virtio flag which indicates that virtualized DMA should be used. This ties into the SEV code to handle memory encryption/decryption, and prevents IO buffers being shared between host and guest.
The DMA will go through bounce buffers, so some overhead is expected compared to non-SEV guests.
(Note: virtio-net device queues are not encrypted.)
All the memory regions allocated by QEMU must be pinned, so that they cannot be swapped to disk. This can be achieved by setting a hard memory limit via
<memtune>section of the domain’s XML. This does not reflect a requirement for additional memory; it is only required in order to achieve the memory pinning.
Another method for pinning the memory is to enable hugepages by booting with the
hw:mem_page_size=largeproperty set either on the flavor or the image, although these have the minor disadvantages of requiring the operator or user to remember one extra detail, and also may require undesirable duplication of flavors or images.
Note that this memory pinning is expected to be a temporary requirement; the latest firmwares already support page copying (as documented by the
COPYAPI in the AMD SEV-KM API Specification), so when OS starts supporting the page-move or page-migration commmand then it will no longer be needed.
Based on instrumentation of QEMU, the limit per VM should be calculated and accounted for as follows:
Memory region type
set by flavor
set by flavor/image
UEFI var store (pflash)
It is also recommended to include an additional padding of at least 256KB for safety, since ROM sizes can occasionally change. For example the total of 10832KB required here for ROMs / ACPI tables should be rounded up to 16MB.
The first two values are expected to commonly vary per VM, and are already accounted for dynamically by the placement service.
The remainder have traditionally (i.e. for non-SEV instances) been accounted for alongside the overhead for the host OS via nova’s memory pool defined by the reserved_host_memory_mb config option, and this does not need to change. However, whilst the overhead incurred is no different to that required for non-SEV instances, it is much more important to get the hard limit right when pinning memory; if it’s too low, the VM will get killed, and if it’s too high, there’s another risk of the host’s OOM killer being invoked, or failing that, the host crashing because it cannot reclaim the memory used by the guest.
Therefore it may be prudent to implement an extra check which multiplies this reservation requirement by the number of instances and ensures that it does not cause the host’s memory usage to exceed what’s available. This can probably be initially implemented as a check in the driver, but regardless, the way to avoid this over-commitment must be documented so that operators can correctly plan memory usage and configure their cloud accordingly.
So for example assuming a 4GB VM:<domain type='kvm'> <os> <type arch='x86_64' machine='pc-q35-2.11'>hvm</type> <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x86_64-ms-4m-code.bin</loader> <nvram>/var/lib/libvirt/qemu/nvram/sles15-sev-guest_VARS.fd</nvram> <boot dev='hd'/> </os> <launchSecurity type='sev'> <cbitpos>47</cbitpos> <reducedPhysBits>1</reducedPhysBits> <policy>0x0037</policy> </launchSecurity> <memtune> <hard_limit unit='KiB'>4718592</hard_limit> ... </memtune> <devices> <rng model='virtio'> <driver iommu='on'/> ... </rng> <memballoon model='virtio'> <driver iommu='on' /> ... </memballoon> ... <video> <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/> </video> ... </devices> ... </domain>
If SEV’s requirement of a Q35 machine type cannot be satisfied by
hw_machine_type specified by the image (if present), or the value
nova.conf (which is
not set by default),
then an exception should be raised so that the build fails.
reducedPhysBits are dependent on the processor
family, and can be obtained through the
sev element from the
policy allows a particular SEV policy, as documented in the AMD
SEV-KM API Specification. Initially the policy will be hardcoded and
not modifiable by cloud tenants or cloud operators. The policy will
#define SEV_POLICY_NORM \ ((SEV_POLICY)(SEV_POLICY_NODBG|SEV_POLICY_NOKS| \ SEV_POLICY_ES|SEV_POLICY_DOMAIN|SEV_POLICY_SEV))
which equates to
0x0037. In the future, when support is added to
QEMU and libvirt, this will permit live migration to other machines in
the same cluster 1 (i.e. with the same OCA cert) and uses SEV-ES,
but doesn’t permit other guests or the hypervisor to directly inspect
memory. If the upstream support for SEV-ES does not arrive in time
then SEV-ES will be not be included in the policy.
A future spec could be submitted to make this configurable via an extra spec or image property.
The sum of the work described above could also mean that images with
trait:HW_CPU_AMD_SEV=required would similarly affect
the process of launching instances.
Even though live migration is not currently supported by the hypervisor software stack, it will be in the future.
The following limitations will be removed in the future as the hardware, firmware, and various layer of software receive new features:
SEV-encrypted VMs cannot yet be live-migrated, or suspended, consequently nor resumed. As already mentioned, support is coming in the future. However this does mean that in the short term, usage of SEV will have an impact on compute node maintenance, since SEV-encrypted instances will need to be fully shut down before migrating off an SEV host.
SEV-encrypted VMs cannot contain directly accessible host devices (PCI passthrough). So for example mdev vGPU support will not currently work. However technologies based on vhost-user should work fine.
The boot disk of SEV-encrypted VMs cannot be
virtio-scsior SATA for the boot disk works as expected, as does
virtio-blkfor non-boot disks.
The following limitations are expected long-term:
The operating system running in an encrypted virtual machine must contain SEV support.
q35machine type does not provide an IDE controller, therefore IDE devices are not supported. In particular this means that nova’s libvirt driver’s current default behaviour on the x86_64 architecture of attaching the config drive as an
iso9660IDE CD-ROM device will not work. There are two potential workarounds:
nova.conffrom its default value
vfat. This will result in
virtiobeing used instead. However this per-host setting could potentially break images with legacy OS’s which expect the config drive to be an IDE CD-ROM. It would also not deal with other CD-ROM devices.
Set the (largely undocumented)
hw_cdrom_busimage property to
virtio, which is recommended as a replacement for
Some potentially cleaner long-term solutions which require code changes are suggested as a stretch goal in the Work Items section below.
For the sake of eliminating any doubt, the following actions are not expected to be limited when SEV encryption is used:
Cold migration or shelve, since they power off the VM before the operation at which point there is no encrypted memory (although this could change since there is work underway to add support for PMEM)
Snapshot, since it only snapshots the disk
Evacuate, since this is only initiated when the VM is assumed to be dead or there is a good reason to kill it
Attaching any volumes, as long as they do not require attaching via an IDE bus
Use of spice / VNC / serial / RDP consoles
Rather than immediately implementing automatic detection of SEV-capable hosts and providing access to these via a new standard trait (
create a custom trait specifically for the purpose of marking flavors as SEV-capable, and then
manually assign that trait to each compute node which is SEV-capable.
This would have the minor advantages of slightly decreasing the amount of effort required in order to reach a functional prototype, and giving operators the flexibility to choose on which compute hosts SEV should be allowed. But conversely it has the disadvantages of requiring merging of hardcoded references to a custom trait into nova’s
masterbranch, requiring extra work for operators, and incurring the risk of a compute node which isn’t capable of SEV (either due to missing hardware or software support) being marked as SEV-capable, which would most likely result in VM launch failures.
Rather than using a single trait to both facilitate the matching of instances requiring SEV with SEV-capable compute hosts and indicate to nova’s libvirt driver that SEV should be used when booting, the trait could be used solely for scheduling of the instance on SEV hosts, and an additional extra spec property such as
hw:sev_policycould be used to ensure that the VM is defined and booted with the necessary extra SEV-specific domain XML.
However this would create extra friction for the administrators defining SEV-enabled flavors, and it is also hard to imagine why anyone would want a flavor which requires instances to run on SEV-capable hosts without simultaneously taking advantage of those hosts’ SEV capability. Additionally, whilst this remains a simple Boolean toggle, using a single trait remains consistent with a pre-existing upstream agreement on how to specify options that impact scheduling and configuration.
Rather than using a standard trait, a normal flavor extra spec could be used to require the SEV feature; however it is understood that this approach is less preferable because traits provide consistent naming for CPU features in some virt drivers, and querying traits is efficient.
Data model impact¶
A new trait will be used to denote SEV-capable compute hosts.
No new data objects or database schema changes will be required.
REST API impact¶
None, although future work may require extending the REST API so that users can verify the hardware’s attestation that the memory was encrypted correctly by the firmware. However if such an extension would not be useful in other virt drivers across multiple CPU vendors, it may be preferable to deliver this functionality via an independent AMD-specific service.
This change does not add or handle any secret information other than of course data within the guest VM’s encrypted memory. The secrets used to implement SEV are locked inside the AMD hardware. The hardware random number generator uses the CTR_DRBG construct from NIST SP 800-90A which has not been found to be susceptible to any back doors. It uses AES counter mode to generate the random numbers.
SEV protects data of a VM from attacks originating from outside the VM, including the hypervisor and other VMs. Attacks which trick the hypervisor into reading pages from another VM will not work because the data obtained will be encrypted with a key which is inaccessible to the attacker and the hypervisor. SEV protects data in caches by tagging each cacheline with the owner of that data which prevents the hypervisor and other VMs from reading the cached data.
SEV does not protect against side-channel attacks against the VM itself or attacks on software running in the VM. It is important to keep the VM up to date with patches and properly configure the software running on the VM.
This first proposed implementation provides some protection but is
notably missing the ability for the cloud user to verify the
attestation which SEV can provide using the
firmware call. Adding such attestation ability in the future would
mean that much less trust would need to be placed in the cloud
administrator because the VM would be encrypted and integrity
protected using keys the cloud user provides to the SEV firmware over
a protected channel. The cloud user would then know with certainty
that they are running the proper image, that the memory is indeed
encrypted, and that they are running on an authentic AMD platform with
SEV hardware and not an impostor platform setup to steal their data.
The cloud user can verify all of this before providing additional
secrets to the VM, for example storage decryption keys. This spec is
a proposed first step in the process of obtaining the full value that
SEV can offer to prevent the cloud administrator from being able to
access the data of the cloud users.
It is strongly recommended that the OpenStack Security Group is kept in the loop and given the opportunity to review each stage of work, to help ensure that security is implemented appropriately.
It may be desirable to access the information that the instance is running encrypted, e.g. a billing cloud provider might want to impose a security surcharge, whereby encrypted instances are billed differently to unencrypted ones. However this should require no immediate impact on notifications, since the instance payload in the versioned notification has the flavor along with its extra specs, where the SEV enablement trait would be defined.
In the case where the SEV trait is specified on the image backing the
server rather than on the flavor, the notification would just have the
image UUID in it. The consumer could look up the image by UUID to
check for the presence of the SEV trait, although this does open up a
potential race window where image properties could change after the
instance was created. This could be remedied by future work which
would include image properties in the instance launch notification, or
storing the image metadata in
instance_extra as is currently done
for the flavor.
Other end user impact¶
The end user will harness SEV through the existing mechanisms of traits in flavor extra specs and image properties. Later on it may make sense to add support for scheduler hints (see the Future Work section below).
No performance impact on nova is anticipated.
Preliminary testing indicates that the expected performance impact on a VM of enabling SEV is moderate; a degradation of 1% to 6% has been observed depending on the particular workload and test. More details can be seen in slides 4–6 of AMD’s presentation on SEV-ES at the 2017 Linux Security Summit.
If compression is being used on swap disks then more storage may be required because the memory of encrypted VMs will not compress to a smaller size.
Memory deduplication mechanisms such as KSM (kernel samepage merging) would be rendered ineffective.
Other deployer impact¶
In order for users to be able to use SEV, the operator will need to perform the following steps:
Deploy SEV-capable hardware as nova compute hosts.
Ensure that they have an appropriately configured software stack, so that the various layers are all SEV ready:
kernel >= 4.16
QEMU >= 2.12
libvirt >= 4.5
ovmf >= commit 75b7aa9528bd 2018-07-06
Finally, a cloud administrator will need to define SEV-enabled flavors as described above, unless it is sufficient for users to define SEV-enabled images.
- Primary assignee:
- Other contributors:
Various developers from SUSE and AMD
It is expected that following sequence of extensions, or similar, will need to be made to nova’s libvirt driver:
Add detection of host SEV capabilities as detailed above.
Consume the new SEV detection code in order to provide the
Add a new
nova.virt.libvirt.LibvirtDriver._set_features()to add the required XML to the VM’s domain definition if the new trait is in the flavor of the VM being launched.
Since live migration between hosts is not (yet) supported for
SEV-encrypted instances, nor
prevent nova from live-migrating any SEV-encrypted instance, or resizing onto a different compute host. Alternatively, nova could catch the error raised by QEMU, which would be propagated via libvirt, and handle it appropriately. We could build in higher-layer checks later if it becomes a major nuisance for operators.
Similarly, attempts to suspend / resume an SEV-encrypted domain are not yet supported, and therefore should either be prevented, or the error caught and handled.
(Stretch goal) Adopt one of the following suggested code changes for reducing or even eliminating usage on x86 architectures of the IDE bus for CD-ROM devices such as the config drive:
Simply change the hardcoded usage of an IDE bus for CD-ROMs on x86 to
scsito be consistent with all other CPU architectures, since it appears that the use of
ideonly remains due to legacy x86 code and the fact that support for other CPU architectures was added later. The
hw_cdrom_bus=ideimage property could override this on legacy images lacking SCSI support.
Auto-detect the cases where the VM has no IDE controller, and automatically switch to
virtio-scsiin those cases.
Introduce a new
nova.confoption for specifying the default bus to use for CD-ROMs. Then for instance the default could be
scsi(for consistency with other CPU architectures) or
hw_cdrom_busoverriding this value where needed. This is likely to be more future-proof as the use of very old machine types is gradually phased out, although the downside is a small risk of breaking legacy images.
If there exist clouds where such legacy x86 images are common, the option could then be set to
hw_cdrom_bus=virtiooverriding when newer machine types are required for SEV (or any other reason). Although this is perhaps sufficiently unlikely as to make a new config option overkill.
Additionally documentation should be written, as detailed in the Documentation Impact section below.
Looking beyond Stein, there is scope for several strands of additional work for enriching nova’s SEV support:
Extend the ComputeCapabilitiesFilter scheduler filter to support scheduler hints, so that SEV can be chosen to be enabled per instance, eliminating the need for operators to configure SEV-specific flavors or images.
If there is sufficient demand from users, make the SEV policy configurable via an extra spec or image property.
Provide some mechanism by which users can access the attestation measurement provided by SEV’s
LAUNCH_MEASUREcommand, in order to verify that the guest memory was encrypted correctly by the firmware. For example, nova’s API could be extended; however if this cannot be done in a manner which applies across virt drivers / CPU vendors, then it may fall outside the scope of nova and require an alternative approach such as a separate AMD-only endpoint.
Special hardware which supports SEV for development, testing, and CI.
Recent versions of the hypervisor software stack which all support SEV, as detailed in Other deployer impact above.
UEFI bugs will need to be addressed if not done so already:
fakelibvirt test driver will need adaptation to emulate
Corresponding unit/functional tests will need to be extended or added to cover:
detection of SEV-capable hardware and software, e.g. perhaps as an extension of
the use of a trait to include extra SEV-specific libvirt domain XML configuration, e.g. within
There will likely be issues to address due to hard-coded assumptions oriented towards Intel CPUs either in Nova code or its tests.
Tempest tests could also be included if SEV hardware is available, either in the gate or via third-party CI.
The KVM section of the Configuration Guide should be updated with details of how to set up SEV-capable hypervisors. It would be prudent to mention the current limitations here too, including the impact on config drive configuration, compute host maintenance, the need to correctly calculate reserved_host_memory_mb based on the expected maximum number of SEV guests simultaneously running on the host, and the details provided above (such as memory region sizes) which cover how to calculate it correctly.
Other non-nova documentation should be updated too:
The documentation for os-traits should be extended where appropriate.
The “Hardening the virtualization layers” section of the Security Guide would be an ideal location to describe the whole process of providing and consuming SEV functionality.