support virtual persistent memory

https://blueprints.launchpad.net/nova/+spec/virtual-persistent-memory

Virtual persistent memory is now supported in both QEMU and libvirt. This spec seeks to enable this support in OpenStack Nova.

Problem description

For many years computer applications organized their data between two tiers: memory and storage. Emerging persistent memory technologies introduce a third tier. Persistent memory (or pmem for short) is accessed like volatile memory, using processor load and store instructions, but it retains its contents across power loss like storage.

Virtualization layer has already supported virtual persistent memory which means virtual machines now can have physical persistent memory as the backend of virtual persistent memory. As far as Nova is concerned, several problems need to be addressed:

  • How is the physical persistent memory managed and presented as virtual persistent memory

  • The discovery and resource tracking of persistent memory

  • How does the user specify the desired amount of virtual persistent memory

  • What is the life cycle of virtual persistent memory

Use Cases

Provide applications with the ability to load large contiguous segments of memory that retain their data across power cycles.

Besides data persistence, persistent memory is less expensive than DRAM and comes with much larger capacities. This is an appealing feature for scenarios that request huge amounts of memory such as high performance computing (HPC).

There has been some exploration by applications which heavily use memory devices such as in memory databases. To name a few: redis, rocksdb, oracle, SAP HANA and Aerospike.

Note

This spec only intends to enable virtual persistent memory for the libvirt KVM driver.

Proposed change

Background

The most efficient way for an applications to use persistent memory is to memory map (mmap()) a portion of persistent memory into the address space of the application. Once the mapping is done, the application accesses the persistent memory directly (also called direct access), meaning without going through kernel or whatever other software in the middle. Persistent memory has two types of hardware interfaces – “PMEM” and “BLK”. Since “BLK” adopts an aperture model to access persistent memory, it does not support direct access. For the sake of efficiency, this spec only proposes to use persistent memory accessed by “PMEM” interface as the backend for QEMU virtualized persistent memory.

Persistent memory must be partitioned into pmem namespaces for applications to use. There are several modes of pmem namespaces for different use scenarios. Mode devdax and mode fsdax both support direct access. Mode devdax gives out a character device for a namespace, thus applications can mmap() the entire namespace into their address spaces. Whereas mode fsdax gives out a block device. It is recommended to use mode devdax to assign persistent memory to virtual machines. Please refer to virtual NVDIMM backends and NVDIMM Linux kernel document for details.

Important

So this spec only proposes to use persistent memory namespaces in devdax mode as QEMU virtual persistent memory backends.

The devdax persistent memory namespaces require contiguous physical space and are not managed in pages as ordinary system memory. This introduces a fragmentation issue with regard to multiple namespaces are created and used by multiple applications. As shown in below diagram, four applications are using four namespaces each of size 100GB:

  +-----+   +-----+   +-----+    +-----+
  |app1 |   |app2 |   |app3 |    |app4 |
  +--+--+   +--+--+   +--+--+    +--+--+
     |         |         |          |
     |         |         |          |
+----v----+----v----+----v----+-----v---+
|         |         |         |         |
|  100GB  |  100GB  |  100GB  |  100GB  |
|         |         |         |         |
+---------+---------+---------+---------+

After the termination of app2 and app4, it turns out to be:

  +-----+             +-----+
  |app1 |             |app3 |
  +--+--+             +--+--+
     |                   |
     |                   |
+----v----+---------+----v----+---------+
|         |         |         |         |
|  100GB  |  100GB  |  100GB  |  100GB  |
|         |         |         |         |
+---------+---------+---------+---------+

The total size of free space is 200GB. However a devdax mode namespace of 200GB size can not be created.

Persistent memory namespace management and resource tracking

Due to the aforementioned fragmentation issue, persistent memory can not be managed in the similar way as system memory. In other words, dynamically creating and deleting persistent memory namespaces upon VM creation and deletion will result in fragmentation and also a challenge to track persistent memory resource. The proposed approach is to use pre-created fix sized namespaces. In other words, the cloud admin creates persistent memory of the desired sizes before Nova is deployed on a certain host. And the cloud admin puts the namespace information into nova config file (details below). Nova compute agent discovers the namespaces by parsing the config file to determine what namespaces it can allocate to a guest. The discovered persistent memory namespaces will be reported to the placement service as inventories of a custom resource class associated with the ROOT resource provider.

Custom Resource Classes are used to represent persistent memory namespace resource. The naming convention of the custom resource classes being used is:

CUSTOM_PMEM_NAMESPACE_$LABEL

$LABEL is variable part of the resource class name defined by the admin to be associated with a certain number of persistent memory namespaces. It normally is the size of namespaces in any desired units. It can also be a string describing the capacities – such as ‘SMALL’, ‘MEDIUM’ or ‘LARGE’. Admin shall properly define the value of ‘$LABEL’ for each namespace.

The association between $LABEL and persistent memory namespaces is defined by a new configuration option ‘CONF.libvirt.pmem_namespaces’. This config option is of string type in below format:

"$LABEL:$NSNAME[|$NSNAME][,$LABEL:$NSNAME[|$NSNAME]]"

$NSNAME is the name of the persistent memory namespace that falls into the resource class named CUSTOM_PMEM_NAMESPACE_$LABEL. A name can be given to a persitent memory namespace upon creation by the “-n/–name” option to the ndctl command.

To give an example, on a certain host, there might be a below configuration:

"128G:ns0|ns1|ns2|ns3,262144MB:ns4|ns5,MEDIUM:ns6|ns7"

The interpretation of the above configuration is that this host has 4 persistent memory namespaces (ns0, ns1, ns2, ns3) of resource class CUSTOM_PMEM_NAMESPACE_128G, 2 namespaces (ns4, ns5) of resource class CUSTOM_PMEM_NAMESPACE_262144MB, and 2 namespaces (ns6, ns7) of resource class CUSTOM_PMEM_NAMESPACE_MEDIUM.

The ‘total’ value of the inventory is the number of the persistent memory namespaces belong to this resource class.

The ‘max_unit’ is set to the same value as ‘total’ since it is possible to attach all of the persistent memory namespaces in a certain resource class to one instance.

The values of ‘min_unit’ and ‘step_size’ are 1.

The value of ‘allocation_ratio’ is 1.0.

In case of the above example, the response to a GET request to this resource provider inventories is:

"inventories": {
        ...
        "CUSTOM_PMEM_NAMESPACE_128GB": {
            "allocation_ratio": 1.0,
            "max_unit": 4,
            "min_unit": 1,
            "reserved": 0,
            "step_size":1,
            "total": 4
        },
        "CUSTOM_PMEM_NAMESPACE_262144MB": {
            "allocation_ratio": 1.0,
            "max_unit": 2,
            "min_unit": 1,
            "reserved": 0,
            "step_size": 1,
            "total": 2
        },
        "CUSTOM_PMEM_NAMESPACE_MEDIUM": {
            "allocation_ratio": 1.0,
            "max_unit": 2,
            "min_unit": 1,
            "reserved": 0,
            "step_size": 1,
            "total":2
        },
        ...
}

Please note, this is just an example to show different ways to configure persistent memory namespaces and how they are tracked. There are certainly some flexibility in the naming of the resource class name. It is up to the admin to configure the namespaces properly.

Note

Resource class names are opaque. For example, a request for CUSTOM_PMEM_NAMESPACE_128GB cannot be fulfilled by a CUSTOM_PMEM_NAMESPACE_131072MB resource even though they are (presumably) the same size.

Different units do not convert freely from one to another while embeded in custom resource class names. Meaning a request for a 128GB persistent memory namespace can be fulfilled by a CUSTOM_PMEM_NAMESPACE_128GB resource, but can not be fulfilled by a CUSTOM_PMEM_NAMESPACE_131072MB resource even though they are of the same quantity.

Persistent memory is by nature NUMA sensitive. However for the initial iteration, the resource inventories are put directly under ROOT resource provider of the compute host. Persistent memory NUMA affinity will be adddressed by a seperate follow-on spec.

A change in the configuration will stop the nova compute agent from (re)starting if that change removes any namespaces in use by guests from the configuration.

Virtual persistent memory specification

Virtual persistent memory information is added to guest hardware flavor extra specs in the form of:

hw:pmem=$LABEL[,$LABEL]

$LABEL is the variable part of a resource class name as defined in the Persistent memory namespace management and resource tracking section. Each appearence of a ‘$LABEL’ means a requirement to one persistent memory namespace of CUSTOM_PMEM_NAMESPACE_$LABEL resource class. So there can be multiple appearences of the same $LABEL in one specification. To give an example:

hw:pmem=128GB,128GB

It means a resource requirement of two 128GB persisent memory namespaces.

Libvirt domain specification requires each virtual persistent memory to be associated with one guest NUMA node. If guest NUMA topology is specified in the flavor, the guest virtual persistent memory devices are put under guest NUMA node 0. If guest NUMA topology is not specified in the flavor, a guest NUMA node 0 is constructed implicitly and all guest virutal persistent memory devices are put under it. Please note, under the second circumstance (implicitly constructing a guest NUMA node 0), the construction of guest NUMA node 0 happens at the libvirt driver while the guest libvirt domain specification is being built up. The NUMA topology logic in the scheduler is not applied. And from the perspective of any other parts of Nova, this guest is still a non-NUMA guest.

Examples:

One NUMA node, one 512GB virtual persistent memory:
    hw:numa_nodes=1
    hw:pmem=512GB

One NUMA node, two 512GB virtual persistent memory:
    hw:numa_nodes=1
    hw:pmem=512GB,512GB

Two NUMA nodes, two 512GB virtual persistent memory:
    hw:numa_nodes=2
    hw:pmem=512GB,512GB

    Both of the two virtual persistent memory devices
    are put under NUMA node 0.

No NUMA node, two 512GB virtual persistent memory:
    hw:pmem = 512GB,512GB

    A guest NUMA node 0 is constructed implicitly.
    Both virtual persistent memory devices are put under it.

Important

Qemu does not support backing one virtual persistent memory device by multiple physical persistent memory namespaces, no matter whether they are contiguous or not. So any virtual persistent memory device requested by guests is backed by one physical persistent memory namespace of the exact same resource class.

The extra specs are translated to placement API requests accordingly.

Virtual persistent memory disposal

Due to the persistent nature of host PMEM namespaces, the content of virtual persistent memory in guests shall be zeroed out immediately once the virtual persisent memory is no longer associated with any VM instance (cases like VM deletion, cold/live migration, shelve, evacuate and etc.). Otherwise there will be security concerns. Since persistent memory devices are typically of large size, this may introduce a performance penalty to guest deletion or any other actions involving erasing PMEM namespaces. The standard I/O APIs (read/write) cannot be used with DAX (direct access) devices. The nova compute libvirt driver uses daxio utility (wrapped by privsep library functions) for this purpose.

VM rebuild

The persisent memory namespaces are zeroed out during VM rebuild to get to the initial state of the VM.

VM resize

Adding new virtual persistent memory devices to an instance is allowed. As for a concrete virtual persistent memory device, changing its backing namespace’s resource class is not allowed. This is because it is not always possible to compare the sizes of two namespaces in different resource classes (e.g. CUSTOM_PMEM_NAMESPACE_128G and CUSTOM_PMEM_NAMESPACE_MEDIUM).

By default the content of the original virtual persistent memory is copied to the new virtual persistent memory (if there is). This could be time consuming, so a flavor extra spec is introduced as a flag:

hw:allow_pmem_copy=true|false (default false)

If either the source or target has this flag set to true, the data in virtual persistent memory is copied. If both the source and target have this flag set to false, the data in virtual persistent memory is not copied. This (not copying data) is useful in scenarios such as virtual persistent memory is used as cache. For a graceful shutdown (which resize does), the data in the cache is flushed, so no need to copy the data.

Nova compute libvirt driver uses daxio utility to read out the data from the source persistent memory namespace and write in to the target persistent memory namespace. If the source and target persistent memory namespaces are not on the same host, ssh tunnel is used to channel the data transfer. This ssh tunnel uses the same ssh key as of moving VM disks during resize. The copying via network won’t work in case of cross-cell resize since there is no direct connectivity between the source and target host.

Live migration

Live migration with virtual persistent memory is supported by QEMU. Qemu treats virtual persistent memory as volatile memory in case of live migration. It just takes longer time due to the typical large capacity of virtual persistent memory.

Virtual persistent memory hotplug

This spec does not address the hot plugging of virtual persistent memory.

VM snapshot

The current VM snapshots do not include memory images. For the current phase the virtual persistent images are not included in the VM snapshots. In future, virtual persistent images could be stored in Glance as a separate image format. And flavor extra specs can be used to specify whether to save virtual persistent memory image during VM snapshot.

VM shelve/unshelve

Shelving a VM is to upload the VM snapshot to Glance service. Since the virtual persistent memory image is not included in the VM snapshot, VM shelve/unshelve does not automatically save/restore the virtual persistent memory for the current iteration. As snapshot, saving/restoring virtual persistent memory images could be supported after the persistent memory images can be stored in Glance. The persistent memory namespaces belong to a shelved VM are zeored out after VM being shelve-offloaded.

Alternatives

Persisent memory namespaces can be created/destroyed on the fly as VM creation/deletion. This ways is more flexible than the fix sized approach, however it will result in fragmentation as detailed in the Background section.

Another model of fix sized appoach other than the proposed one could be evenly partitioning the entire persistent memory space into namespaces of the same size and setting the step_size of the persistent memory resource provider to the size of each namespace. However this model assumes a larger namespace can be assembled from multiple smaller namespaces (a 256GB persistent memory requirement may land on 2x128GB namespaces) which is not the case.

Persistent memory demonstrates certain similarity with block devices in its non-volatile nature and life cycle management. It is possible to stick it into block device mapping (BDM) interface. However, NUMA affinity support is in the future of persistent memory and BDM is not the ideal interface to decribe NUMA.

Data model impact

A new VirtualPMEM object is introduced to track the virtual PMEM information of an instance, it stands for a virtual persistent memory device backed by a physical persistent memory namespace:

class VirtualPMEM(base.NovaObject):
    # Version 1.0: Initial version
    VERSION = "1.0"

    fields = {
        'rc_name': fields.StringField(),
        'ns_name': fields.StringField(nullable=True),
    }

In addition a VirtualPMEMList object is introduced to represent a list of VirtualPMEM objects:

class VirtualPMEMList(base.NovaObject):
    # Version 1.0: Initial version
    VERSION = "1.0"

    fields = {
        'vpmems': fields.ListOfObjectsField('VirtualPMEM'),
    }

A ‘vpmems’ deferred-load column is added to class InstanceExtra, which stores a serialized VirtualPMEMList object for a given instance:

class InstanceExtra(BASE, NovaBase, models.SoftDeleteMixin):
     ...
     migration_context = orm.deferred(Column(Text))
     keypairs = orm.deferred(Column(Text))
     trusted_certs = orm.deferred(Column(Text))
     vpmems = orm.deferred(Column(Text))
     instance = orm.relationship(Instance,
     ...

Two new fields are introduced to MigrationContext to hold the old and new virtual persistent memory devices during migration:

class MigrationContext(base.NovaPersistentObject, base.NovaObject):
     ...
     fields = {
         'old_pci_requests': fields.ObjectField('InstancePCIRequests',
                                                 nullable=True),
         'new_vpmems': fields.ObjectField('VirtualPMEMList',
                                          nullable=True),
         'old_vpmems': fields.ObjectField('VirtualPMEMList',
                                          nullable=True),
    }

REST API impact

Flavor extra specs already accept arbitrary data. No new micro version introduced.

Security impact

Host persistent memory namespaces needs to be erased (zeroed) to be reused.

Notifications impact

None.

Other end user impact

End users choose flavors with desired virtual persistent memory sizes.

Performance Impact

PMEM namespaces tend to be large. Zeroing out a persistent memory namespace requires a considerable amount of time. This may introduce a negative performance impact when deleting a guest with large virtual persistent memories.

Other deployer impact

The deployer needs to create persistent memory namespaces of the desired sizes before nova is deployed on a certain host.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:

xuhj

Other contributors:

luyaozhong rui-zang

Work Items

  • Object: add DB model and Nova object.

  • Compute: virtual persistent memory life cycle management.

  • Scheduler: translate virtual persistent memory request to

    placement requests.

  • API: parse virtual persistent memory flavor extra specs.

Dependencies

  • Kernel version >= 4.2

  • QEMU version >= 2.9.0

  • Libvirt version >= 5.0.0

  • ndctl version >= 4.7

  • daxio version >= 1.6

Testing

Unit tests. Third party CI is required for testing on real hardware. Persistent memory nested virtualization works for QEMU/KVM. For the third party CI, tempest tests are executed in a VM with virtual persisent memory backed by physical persistent memory.

Documentation Impact

The cloud administrator docs need to describe how to create and configure persistent memory namespaces. Add a persitent memory section into the Nova “advanced configuration” document.

The end user needs to be make aware of this feature. Add the flavor extra spec details into the Nova flavors document.

References

History

Revisions

Release Name

Description

Train

Introduced