Virt driver large page allocation for guest RAM

https://blueprints.launchpad.net/nova/+spec/virt-driver-large-pages

This feature aims to improve the libvirt driver so that it can use large pages for backing the guest RAM allocation. This will improve the performance of guest workloads by increasing TLB cache efficiency. It will ensure that the guest has 100% dedicated RAM that will never be swapped out.

Problem description

Most modern virtualization hosts support a variety of memory page sizes. On x86 the smallest, used by the kernel by default, is 4kb, while large sizes include 2MB and 1GB. The CPU TLB cache has a limited size, so when there is a very large amount of RAM present and utilized, the cache efficiency can be fairly low which in turn increases memory access latency. By using larger page sizes, there are fewer entries needed in the TLB and thus its efficiency goes up.

The use of huge pages for backing guests implies that the guest is running with a dedicated resource allocation. ie the concept of memory overcommit is no longer possible to provide. This is a tradeoff that cloud administrators may be willing to make to support workloads that require predictable memory access times, such as NFV.

While large pages are better than small pages, it can’t be assumed that the benefit increases as the page size increases. In some workloads, a 2 MB page size can be better overall than 1 GB page sizes. Also the choice of page size affects the granularity of guest RAM size. ie a 1.5 GB guest would not be able to use 1 GB pages since RAM is not a multiple of the page size.

Although it is theoretically possible to reserve large pages on the fly, after a host has been booted for a period of time, physical memory will have become very fragmented. This means that even if the host has lots of free memory, it may be unable to find contiguous chunks required to provide large pages. This is a particular problem for 1 GB sized pages. To deal with this problem, it is usual practice to reserve all required large pages upfront at host boot time, by specifying a reservation count on the kernel command line of the host. This would be a one-time setup task done when deploying new compute node hosts.

Proposed change

The flavor extra specs will be enhanced to support a new parameter

  • hw:mem_page_size=large|small|any|<custom page size in KiB>

In absence of any page size setting in the flavor, the current behaviour of using the small, default, page size will continue. A setting of ‘large’ says to only use larger page sizes for guest RAM, eg either 2MB or 1GB on x86; ‘small’ says to only use the small page sizes, eg 4k on x86, and is the default; ‘any’ means to leave policy upto the compute driver implementation to decide. When seeing ‘any’ the libvirt driver might try to find large pages, but fallback to small pages, but other drivers may choose alternate policies for ‘any’. Finally an explicit page size can be set if the workload has very precise requirements for a specific large page size. It is expected that the common case would be to use page_size=large or page_size=any. The specification of explicit page sizes would be something that NFV workloads would require.

The property defined for the flavor can also be set against the image, but the use of large pages would only be honoured if the flavor already had a policy or ‘large’ or ‘any’. ie if the flavor said ‘small’, or a specific numeric page size, the image would not be permitted to override this to access other large page sizes. Such invalid override in the image would result in an exception being raised and the attempt to boot the instance resulting in an error. While ultimate validation is done in the virt driver, this can also be caught and reported at the at the API layer.

If the flavor memory size is not a multiple of the specified huge page size this would be considered an error which would cause the instance to fail to boot. If the page size is ‘large’ or ‘any’, then the compute driver would obviously attempt to pick a page size which was a multiple of the RAM size rather than erroring. This is only likely to be a significant problem when when using 1 GB page sizes, which imply that ram size must be in 1 GB increments.

The libvirt driver will be enhanced to honour this parameter when configuring the guest RAM allocation policy. This will effectively introduce the concept of a “dedicated memory” guest, since large pages must be 1-to-1 associated with guests - there’s not facility to over commit by allowing one large page to be used with multiple guests or to swap large pages.

The libvirt driver will be enhanced to report on large page availability per NUMA node, building on previously added NUMA topology reporting.

The scheduler will be enhanced to take account of the page size setting on the flavor and pick hosts which have sufficient large pages available when scheduling the instance. Conversely if large pages are not requested, then the scheduler needs to avoid placing the instance on a host which has pre-reserved large pages. The enhancements for the scheduler will be done as part of the new filter that is implemented as part of the NUMA topology blueprint. This involves altering the logic done in that blueprint, so that instead of just looking at free memory in each NUMA node, it instead looks at the free page count for the desired page size.

As illustrated later in this document each host will be reporting on all page sizes available and this information will be available to the scheduler. So when it interprets ‘small’, it will consider the smallest page size reported by the compute node. Conversely when intepreting ‘large’ it will consider any page size except the smallest one. This obviously implies that there is potential for ‘large’ and ‘small’ to have different meanings depending on the host being considered. For the use cases where this would be a problem, an explicit page size would be requested instead of using these symbolic named sizes. It will also have to consider whether the page size is a multiple of the flavor memory size. If the instance is using multiple NUMA nodes, it will have to consider whether the RAM in each guest node is a multiple of the page size, rather than the total memory size.

Alternatives

Recent Linux hosts have a concept of “transparent huge pages” where the kernel will opportunistically allocate large pages for guest VMs. The problem with this is that over time, the kernel’s memory allocations get very fragmented making it increasingly hard to find contiguous blocks of RAM to use for large pages. This makes transparent large pages impractical for use with 1 GB page sizes. The opportunistic approach also means that users do not have any hard guarantee that their instance will be backed by large pages. This makes it an unusable approach for NFV workloads which require hard guarantees.

Data model impact

The previously added data in the host state structure for reporting NUMA topology would be enhanced to further include information on page size availability per node. So it would then look like

hw_numa = {
   nodes = [
       {
          id = 0
          cpus = 0, 2, 4, 6
          mem = {
             total = 10737418240
             free = 3221225472
          },
          mempages = [{
               size_kb = 4,
               total = 262144,
               used = 262144,
             }, {
               size_kb = 2048,
               total = 1024,
               used = 1024,
             }, {
               size_kb = 1048576,
               total = 7,
               used = 0,
             }
          ]
          distances = [ 10, 20],
       },
       {
          id = 1
          cpus = 1, 3, 5, 7
          mem = {
             total = 10737418240
             free = 5368709120
          },
          mempages = [{
               size_kb = 4,
               total = 262144,
               used = 512,
             }, {
               size_kb = 2048,
               total = 1024,
               used = 128,
             }, {
               size_kb = 1048576,
               total = 7,
               used = 4,
             }
          ]
          distances = [ 20, 10],
       }
   ],
}

REST API impact

No impact.

The existing APIs already support arbitrary data in the flavor extra specs.

Security impact

No impact.

Notifications impact

No impact.

The notifications system is not used by this change.

Other end user impact

There are no changes that directly impact the end user, other than the fact that their guest should have more predictable memory access latency.

Performance Impact

The scheduler will have more logic added to take into account large page availability per NUMA node when placing guests. Most of this impact will have already been incurred when initial NUMA support was added to the scheduler. This change is merely altering the NUMA support such that it considers the free large pages instead of overall RAM size.

Other deployer impact

The cloud administrator will gain the ability to set large page policy on the flavors they configured. The administrator will also have to configure their compute hosts to reserve large pages at boot time, and place those hosts into a group using aggregates.

It is possible that there might be a need to expose information on the page counts to host administrators via the Nova API. Such a need can be considered for followup work once the work refernced in this basic spec is completed

Developer impact

If other hypervisors allow the control over large page usage, they could be enhanced to support the same flavor extra specs settings. If the hypervisor has self-determined control over large page usage, then it is valid to simply ignore this new flavor setting. ie do nothing.

Implementation

Assignee(s)

Primary assignee:

berrange

Other contributors:

ndipanov, sahid-ferdjaoui

Work Items

  • Enhance libvirt driver to report available large pages per NUMA node in the host state data

  • Enhance libvirt driver to configure guests based on the flavor parameter for page sizes

  • Add support to scheduler to place instances on hosts according to the availability of required large pages

Dependencies

  • Virt driver guest NUMA node placement & topology. This blueprint is going to be an extension of the work done in the compute driver and scheduler for NUMA placement, since large pages must be allocated from matching guest & host NUMA node to avoid cross-node memory access

  • Libvirt / KVM need to be enhanced to allow Nova to indicate that large pages should be allocated from specific NUMA nodes on the host. This is not a blocker to supporting large pages in Nova, since it can use the more general large page support in libvirt, however, the performance benefits won’t be fully realized until per-NUMA node large page allocation can be done.

Testing

Testing this in the gate would be difficult since the hosts which run the gate tests would have to be pre-configured with large pages allocated at initial OS boot time. This in turn would preclude running gate tests with guests that do not want to use large pages.

Documentation Impact

The new flavor parameter available to the cloud administrator needs to be documented along with recommendations about effective usage. The docs will also need to mention the compute host deployment pre-requisites such as the need to pre-allocate large pages at boot time and setup aggregates.

References

Current “big picture” research and design for the topic of CPU and memory resource utilization and placement. vCPU topology is a subset of this work