libvirt driver launching AMD SEV-encrypted instances

https://blueprints.launchpad.net/nova/+spec/amd-sev-libvirt-support

This spec proposes work required in order for nova’s libvirt driver to support launching of KVM instances which are encrypted using AMD’s SEV (Secure Encrypted Virtualization) technology.

Problem description

While data is typically encrypted today when stored on disk, it is stored in DRAM in the clear. This can leave the data vulnerable to snooping by unauthorized administrators or software, or by hardware probing. New non-volatile memory technology (NVDIMM) exacerbates this problem since an NVDIMM chip can be physically removed from a system with the data intact, similar to a hard drive. Without encryption any stored information such as sensitive data, passwords, or secret keys can be easily compromised.

AMD’s SEV offers a VM protection technology which transparently encrypts the memory of each VM with a unique key. It can also calculate a signature of the memory contents, which can be sent to the VM’s owner as an attestation that the memory was encrypted correctly by the firmware. SEV is particularly applicable to cloud computing since it can reduce the amount of trust VMs need to place in the hypervisor and administrator of their host system.

Use Cases

  1. As a cloud administrator, in order that my users can have greater confidence in the security of their running instances, I want to provide a flavor containing an SEV-specific required trait extra spec which will allow users booting instances with that flavor to ensure that their instances run on an SEV-capable compute host with SEV encryption enabled.

  2. As a cloud user, in order to not have to trust my cloud operator with my secrets, I want to be able to boot VM instances with SEV functionality enabled.

Proposed change

For Stein, the goal is a minimal but functional implementation which would satisfy the above use cases. It is proposed that initial development and testing would include the following deliverables:

  • Add detection of host SEV capabilities. Logic is required to check that the various layers of the hardware and software hypervisor stack are SEV-capable:

    Note that both checks are required, since the presence of the first does not imply the second.

  • Support a standard trait which would be automatically detected per compute host based on the above logic. This would most likely be called HW_CPU_AMD_SEV or similar, as an extension of the existing CPU traits mapping.

  • When present in the flavor, this standard trait would indicate that the libvirt driver should include extra XML in the guest’s domain definition, in order to ensure the following:

    • SEV security is enabled.

    • The boot disk cannot be virtio-blk (due to a resource constraint w.r.t. bounce buffers).

    • The VM uses machine type q35 and UEFI via OVMF. (q35 is required in order to bind all the virtio devices to the PCIe bridge so that they use virtio 1.0 and not virtio 0.9, since QEMU’s iommu_platform feature is added in virtio 1.0 only.)

    • The iommu attribute is on for all virtio devices. Despite the name, this does not require the guest or host to have an IOMMU device, but merely enables the virtio flag which indicates that virtualized DMA should be used. This ties into the SEV code to handle memory encryption/decryption, and prevents IO buffers being shared between host and guest.

      The DMA will go through bounce buffers, so some overhead is expected compared to non-SEV guests.

      (Note: virtio-net device queues are not encrypted.)

    • All the memory regions allocated by QEMU must be pinned, so that they cannot be swapped to disk. This can be achieved by setting a hard memory limit via <hard_limit> in the <memtune> section of the domain’s XML. This does not reflect a requirement for additional memory; it is only required in order to achieve the memory pinning.

      Another method for pinning the memory is to enable hugepages by booting with the hw:mem_page_size=large property set either on the flavor or the image, although these have the minor disadvantages of requiring the operator or user to remember one extra detail, and also may require undesirable duplication of flavors or images.

      Note that this memory pinning is expected to be a temporary requirement; the latest firmwares already support page copying (as documented by the COPY API in the AMD SEV-KM API Specification), so when OS starts supporting the page-move or page-migration commmand then it will no longer be needed.

      Based on instrumentation of QEMU, the limit per VM should be calculated and accounted for as follows:

      Memory region type

      Size

      Accounting mechanism

      VM RAM

      set by flavor

      placement service

      video memory

      set by flavor/image

      placement service

      UEFI ROM

      4096KB

      reserved_host_memory_mb

      UEFI var store (pflash)

      4096KB

      reserved_host_memory_mb

      pc.rom

      128KB

      reserved_host_memory_mb

      isa-bios

      128KB

      reserved_host_memory_mb

      ACPI tables

      2384KB

      reserved_host_memory_mb

It is also recommended to include an additional padding of at least 256KB for safety, since ROM sizes can occasionally change. For example the total of 10832KB required here for ROMs / ACPI tables should be rounded up to 16MB.

The first two values are expected to commonly vary per VM, and are already accounted for dynamically by the placement service.

The remainder have traditionally (i.e. for non-SEV instances) been accounted for alongside the overhead for the host OS via nova’s memory pool defined by the reserved_host_memory_mb config option, and this does not need to change. However, whilst the overhead incurred is no different to that required for non-SEV instances, it is much more important to get the hard limit right when pinning memory; if it’s too low, the VM will get killed, and if it’s too high, there’s another risk of the host’s OOM killer being invoked, or failing that, the host crashing because it cannot reclaim the memory used by the guest.

Therefore it may be prudent to implement an extra check which multiplies this reservation requirement by the number of instances and ensures that it does not cause the host’s memory usage to exceed what’s available. This can probably be initially implemented as a check in the driver, but regardless, the way to avoid this over-commitment must be documented so that operators can correctly plan memory usage and configure their cloud accordingly.

So for example assuming a 4GB VM:

<domain type='kvm'>
  <os>
    <type arch='x86_64' machine='pc-q35-2.11'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x86_64-ms-4m-code.bin</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/sles15-sev-guest_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>
  <launchSecurity type='sev'>
    <cbitpos>47</cbitpos>
    <reducedPhysBits>1</reducedPhysBits>
    <policy>0x0037</policy>
  </launchSecurity>
  <memtune>
    <hard_limit unit='KiB'>4718592</hard_limit>
    ...
  </memtune>
  <devices>
    <rng model='virtio'>
      <driver iommu='on'/>
      ...
    </rng>
    <memballoon model='virtio'>
      <driver iommu='on' />
      ...
    </memballoon>
    ...
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1'  primary='yes'/>
    </video>
    ...
  </devices>
  ...
</domain>

If SEV’s requirement of a Q35 machine type cannot be satisfied by hw_machine_type specified by the image (if present), or the value specified by libvirt.hw_machine_type in nova.conf (which is not set by default), then an exception should be raised so that the build fails.

cbitpos and reducedPhysBits are dependent on the processor family, and can be obtained through the sev element from the domain capabilities.

policy allows a particular SEV policy, as documented in the AMD SEV-KM API Specification. Initially the policy will be hardcoded and not modifiable by cloud tenants or cloud operators. The policy will be:

#define SEV_POLICY_NORM \
    ((SEV_POLICY)(SEV_POLICY_NODBG|SEV_POLICY_NOKS| \
      SEV_POLICY_ES|SEV_POLICY_DOMAIN|SEV_POLICY_SEV))

which equates to 0x0037. In the future, when support is added to QEMU and libvirt, this will permit live migration to other machines in the same cluster [1] (i.e. with the same OCA cert) and uses SEV-ES, but doesn’t permit other guests or the hypervisor to directly inspect memory. If the upstream support for SEV-ES does not arrive in time then SEV-ES will be not be included in the policy.

A future spec could be submitted to make this configurable via an extra spec or image property.

For reference, the AMDSEV GitHub repository provides a complete example of a domain’s XML definition with libvirt’s SEV options enabled.

The sum of the work described above could also mean that images with the property trait:HW_CPU_AMD_SEV=required would similarly affect the process of launching instances.

Limitations

The following limitations will be removed in the future as the hardware, firmware, and various layer of software receive new features:

  • SEV-encrypted VMs cannot yet be live-migrated, or suspended, consequently nor resumed. As already mentioned, support is coming in the future. However this does mean that in the short term, usage of SEV will have an impact on compute node maintenance, since SEV-encrypted instances will need to be fully shut down before migrating off an SEV host.

  • SEV-encrypted VMs cannot contain directly accessible host devices (PCI passthrough). So for example mdev vGPU support will not currently work. However technologies based on vhost-user should work fine.

  • The boot disk of SEV-encrypted VMs cannot be virtio-blk. Using virtio-scsi or SATA for the boot disk works as expected, as does virtio-blk for non-boot disks.

The following limitations are expected long-term:

  • The operating system running in an encrypted virtual machine must contain SEV support.

  • The q35 machine type does not provide an IDE controller, therefore IDE devices are not supported. In particular this means that nova’s libvirt driver’s current default behaviour on the x86_64 architecture of attaching the config drive as an iso9660 IDE CD-ROM device will not work. There are two potential workarounds:

    1. Change CONF.config_drive_format in nova.conf from its default value iso9660 to vfat. This will result in virtio being used instead. However this per-host setting could potentially break images with legacy OS’s which expect the config drive to be an IDE CD-ROM. It would also not deal with other CD-ROM devices.

    2. Set the (largely undocumented) hw_cdrom_bus image property to virtio, which is recommended as a replacement for ide, and hw_scsi_model to virtio-scsi.

    Some potentially cleaner long-term solutions which require code changes are suggested as a stretch goal in the Work Items section below.

For the sake of eliminating any doubt, the following actions are not expected to be limited when SEV encryption is used:

  • Cold migration or shelve, since they power off the VM before the operation at which point there is no encrypted memory (although this could change since there is work underway to add support for PMEM)

  • Snapshot, since it only snapshots the disk

  • Evacuate, since this is only initiated when the VM is assumed to be dead or there is a good reason to kill it

  • Attaching any volumes, as long as they do not require attaching via an IDE bus

  • Use of spice / VNC / serial / RDP consoles

  • VM guest virtual NUMA (a.k.a. vNUMA)

Alternatives

  1. Rather than immediately implementing automatic detection of SEV-capable hosts and providing access to these via a new standard trait (HW_CPU_AMD_SEV or similar),

    This would have the minor advantages of slightly decreasing the amount of effort required in order to reach a functional prototype, and giving operators the flexibility to choose on which compute hosts SEV should be allowed. But conversely it has the disadvantages of requiring merging of hardcoded references to a custom trait into nova’s master branch, requiring extra work for operators, and incurring the risk of a compute node which isn’t capable of SEV (either due to missing hardware or software support) being marked as SEV-capable, which would most likely result in VM launch failures.

  2. Rather than using a single trait to both facilitate the matching of instances requiring SEV with SEV-capable compute hosts and indicate to nova’s libvirt driver that SEV should be used when booting, the trait could be used solely for scheduling of the instance on SEV hosts, and an additional extra spec property such as hw:sev_policy could be used to ensure that the VM is defined and booted with the necessary extra SEV-specific domain XML.

    However this would create extra friction for the administrators defining SEV-enabled flavors, and it is also hard to imagine why anyone would want a flavor which requires instances to run on SEV-capable hosts without simultaneously taking advantage of those hosts’ SEV capability. Additionally, whilst this remains a simple Boolean toggle, using a single trait remains consistent with a pre-existing upstream agreement on how to specify options that impact scheduling and configuration.

  3. Rather than using a standard trait, a normal flavor extra spec could be used to require the SEV feature; however it is understood that this approach is less preferable because traits provide consistent naming for CPU features in some virt drivers, and querying traits is efficient.

Data model impact

A new trait will be used to denote SEV-capable compute hosts.

No new data objects or database schema changes will be required.

REST API impact

None, although future work may require extending the REST API so that users can verify the hardware’s attestation that the memory was encrypted correctly by the firmware. However if such an extension would not be useful in other virt drivers across multiple CPU vendors, it may be preferable to deliver this functionality via an independent AMD-specific service.

Security impact

This change does not add or handle any secret information other than of course data within the guest VM’s encrypted memory. The secrets used to implement SEV are locked inside the AMD hardware. The hardware random number generator uses the CTR_DRBG construct from NIST SP 800-90A which has not been found to be susceptible to any back doors. It uses AES counter mode to generate the random numbers.

SEV protects data of a VM from attacks originating from outside the VM, including the hypervisor and other VMs. Attacks which trick the hypervisor into reading pages from another VM will not work because the data obtained will be encrypted with a key which is inaccessible to the attacker and the hypervisor. SEV protects data in caches by tagging each cacheline with the owner of that data which prevents the hypervisor and other VMs from reading the cached data.

SEV does not protect against side-channel attacks against the VM itself or attacks on software running in the VM. It is important to keep the VM up to date with patches and properly configure the software running on the VM.

This first proposed implementation provides some protection but is notably missing the ability for the cloud user to verify the attestation which SEV can provide using the LAUNCH_MEASURE firmware call. Adding such attestation ability in the future would mean that much less trust would need to be placed in the cloud administrator because the VM would be encrypted and integrity protected using keys the cloud user provides to the SEV firmware over a protected channel. The cloud user would then know with certainty that they are running the proper image, that the memory is indeed encrypted, and that they are running on an authentic AMD platform with SEV hardware and not an impostor platform setup to steal their data. The cloud user can verify all of this before providing additional secrets to the VM, for example storage decryption keys. This spec is a proposed first step in the process of obtaining the full value that SEV can offer to prevent the cloud administrator from being able to access the data of the cloud users.

It is strongly recommended that the OpenStack Security Group is kept in the loop and given the opportunity to review each stage of work, to help ensure that security is implemented appropriately.

Notifications impact

It may be desirable to access the information that the instance is running encrypted, e.g. a billing cloud provider might want to impose a security surcharge, whereby encrypted instances are billed differently to unencrypted ones. However this should require no immediate impact on notifications, since the instance payload in the versioned notification has the flavor along with its extra specs, where the SEV enablement trait would be defined.

In the case where the SEV trait is specified on the image backing the server rather than on the flavor, the notification would just have the image UUID in it. The consumer could look up the image by UUID to check for the presence of the SEV trait, although this does open up a potential race window where image properties could change after the instance was created. This could be remedied by future work which would include image properties in the instance launch notification, or storing the image metadata in instance_extra as is currently done for the flavor.

Other end user impact

The end user will harness SEV through the existing mechanisms of traits in flavor extra specs and image properties. Later on it may make sense to add support for scheduler hints (see the Future Work section below).

Performance Impact

No performance impact on nova is anticipated.

Preliminary testing indicates that the expected performance impact on a VM of enabling SEV is moderate; a degradation of 1% to 6% has been observed depending on the particular workload and test. More details can be seen in slides 4–6 of AMD’s presentation on SEV-ES at the 2017 Linux Security Summit.

If compression is being used on swap disks then more storage may be required because the memory of encrypted VMs will not compress to a smaller size.

Memory deduplication mechanisms such as KSM (kernel samepage merging) would be rendered ineffective.

Other deployer impact

In order for users to be able to use SEV, the operator will need to perform the following steps:

  • Deploy SEV-capable hardware as nova compute hosts.

  • Ensure that they have an appropriately configured software stack, so that the various layers are all SEV ready:

    • kernel >= 4.16

    • QEMU >= 2.12

    • libvirt >= 4.5

    • ovmf >= commit 75b7aa9528bd 2018-07-06

Finally, a cloud administrator will need to define SEV-enabled flavors as described above, unless it is sufficient for users to define SEV-enabled images.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:

adam.spiers

Other contributors:

Various developers from SUSE and AMD

Work Items

It is expected that following sequence of extensions, or similar, will need to be made to nova’s libvirt driver:

  1. Add detection of host SEV capabilities as detailed above.

  2. Consume the new SEV detection code in order to provide the HW_CPU_AMD_SEV trait.

  3. Add a new nova.virt.libvirt.LibvirtConfigGuestFeatureSEV class.

  4. Extend nova.virt.libvirt.LibvirtDriver._set_features() to add the required XML to the VM’s domain definition if the new trait is in the flavor of the VM being launched.

  5. Since live migration between hosts is not (yet) supported for

    prevent nova from live-migrating any SEV-encrypted instance, or resizing onto a different compute host. Alternatively, nova could catch the error raised by QEMU, which would be propagated via libvirt, and handle it appropriately. We could build in higher-layer checks later if it becomes a major nuisance for operators.

  6. Similarly, attempts to suspend / resume an SEV-encrypted domain are not yet supported, and therefore should either be prevented, or the error caught and handled.

  7. (Stretch goal) Adopt one of the following suggested code changes for reducing or even eliminating usage on x86 architectures of the IDE bus for CD-ROM devices such as the config drive:

    1. Simply change the hardcoded usage of an IDE bus for CD-ROMs on x86 to scsi to be consistent with all other CPU architectures, since it appears that the use of ide only remains due to legacy x86 code and the fact that support for other CPU architectures was added later. The hw_cdrom_bus=ide image property could override this on legacy images lacking SCSI support.

    2. Auto-detect the cases where the VM has no IDE controller, and automatically switch to scsi or virtio-scsi in those cases.

    3. Introduce a new nova.conf option for specifying the default bus to use for CD-ROMs. Then for instance the default could be scsi (for consistency with other CPU architectures) or virtio, with hw_cdrom_bus overriding this value where needed. This is likely to be more future-proof as the use of very old machine types is gradually phased out, although the downside is a small risk of breaking legacy images.

      If there exist clouds where such legacy x86 images are common, the option could then be set to ide and hw_cdrom_bus=virtio overriding when newer machine types are required for SEV (or any other reason). Although this is perhaps sufficiently unlikely as to make a new config option overkill.

Additionally documentation should be written, as detailed in the Documentation Impact section below.

Future work

Looking beyond Stein, there is scope for several strands of additional work for enriching nova’s SEV support:

  • Extend the ComputeCapabilitiesFilter scheduler filter to support scheduler hints, so that SEV can be chosen to be enabled per instance, eliminating the need for operators to configure SEV-specific flavors or images.

  • If there is sufficient demand from users, make the SEV policy configurable via an extra spec or image property.

  • Provide some mechanism by which users can access the attestation measurement provided by SEV’s LAUNCH_MEASURE command, in order to verify that the guest memory was encrypted correctly by the firmware. For example, nova’s API could be extended; however if this cannot be done in a manner which applies across virt drivers / CPU vendors, then it may fall outside the scope of nova and require an alternative approach such as a separate AMD-only endpoint.

Dependencies

Testing

The fakelibvirt test driver will need adaptation to emulate SEV-capable hardware.

Corresponding unit/functional tests will need to be extended or added to cover:

  • detection of SEV-capable hardware and software, e.g. perhaps as an extension of nova.tests.functional.libvirt.test_report_cpu_traits.LibvirtReportTraitsTests

  • the use of a trait to include extra SEV-specific libvirt domain XML configuration, e.g. within nova.tests.unit.virt.libvirt.test_config

There will likely be issues to address due to hard-coded assumptions oriented towards Intel CPUs either in Nova code or its tests.

Tempest tests could also be included if SEV hardware is available, either in the gate or via third-party CI.

Documentation Impact

  • A new entry should be added in the Feature Support Matrix, which refers to the new trait and shows the current limitations.

  • The KVM section of the Configuration Guide should be updated with details of how to set up SEV-capable hypervisors. It would be prudent to mention the current limitations here too, including the impact on config drive configuration, compute host maintenance, the need to correctly calculate reserved_host_memory_mb based on the expected maximum number of SEV guests simultaneously running on the host, and the details provided above (such as memory region sizes) which cover how to calculate it correctly.

Other non-nova documentation should be updated too:

References

History

Revisions

Release Name

Description

Stein

Introduced