Memory Fragmentation Tuning

In some high memory pressure scenarios, the memory shortage would make the high order pages hard to be allocated and also the page allocation would go to the synchronous frequently reclaim thanks to the default gap between min<->low<->high is too small to wake up the kswapd (asynchronous reclaim) earlier.

Problem Description

In the OpenStack compute node, especially the hyperconverged machine with the Ceph OSDs using a lot of page cache. It is easy to have the memory allocation stall issue. The issue would lead to several issues: The new instance cannot be brought up (KVM needs to allocate order sixth pages) or VM stuck, etc. The reasons are:

1). Compaction for big order page If the THP (Transparent Huge Page) is used with the VM, it will be more severe than the persistent huge pages reserved for the VM’s dedicated usage. The THP needs to allocate the 2MB (x86) huge pages at run time. Moreover, this is the order 9 (2^9 * 4K = 2MB). In running system, it will be hard to get the continuous 512 (2^9) 4K pages according to /proc/pagetypeinfo.

2). Synchronous reclaim. There are three levels of watermark inside the system: 1). min 2). low 3). high. When the number of free pages lowers down to the low watermark. The kswapd will be wakened up to do the asynchronous reclaim. Furthermore, it will not be stopped until the number of free pages reaches the high watermark. However, when the memory allocation is strong enough, the free pages will continue to lower down to the min watermark. At this point, the number of min pages is reserved for emergency usage, and the allocation will go into the direct-reclaim (synchronous) mode. This will stall the process.

Proposed Change

In the past experience, the 1GB gap between min<->low<->high watermark is a good practice in the server environment. The bigger gap can wake up the kswapd earlier and avoid the synchronous reclaim. Moreover, this can alleviate the latency. The sysctl parameters related to the watermark gap calculation:

vm.min_free_kbytes vm.watermark_scale_factor

For the Ubuntu kernel before 4.15 (Bionic), the only way to tune the watermark is to modify the vm.min_free_kbytes. The gap would be 1/4 of the vm.min_free_kbytes. However, increasing the min_free_kbytes is the minimum watermark reservation increase, which will decrease the actual memory that the runtime system can use.

For Ubuntu kernel after 4.15, vm.watermark_scale_factor can be used to increase the gap without increasing the min watermark reservation. The gap is calculated by “watermark_scale_factor/10000 * managed_pages”.

The proposed solution is to set the 1GB watermark gap by using the above two parameters when the compute node is rebooted.

The feature will be designed in flexible ways: 1). There will be a switch to turn on/off the feature. By default, it is turned off. For some small memory compute nodes (<32GB), the 1GB low memory is too many.

2). The manual config has a higher priority to overwrite the default calculation.

Alternatives

The config can be set up in the run time with the following command: juju deploy cs:sysconfig-2 juju add-relation sysconfig nova-compute juju config sysconfig sysctl=”{vm.extfrag_threshold: 200, vm.watermark_scale_factor: 50}”

However, each system might have different memory capacities. The watermark_scale_factor needs to be calculated manually.

Implementation

Assignee(s)

Primary assignee: - Gavin Guo <gavin.guo@canonical.com>

Gerrit Topic

Use Gerrit topic “memory-fragmentation-tuning” for all patches related to this spec.

git-review -t memory-fragmentation-tuning

Work Items

Implement the watermark_scale_factor value calculation to set up the gap to 1GB.

Repositories

No new git Repository is required.

Documentation

The documentation is needed to include the switch to turn on/off the feature.

Security

The use of this feature exposes no other security attack surface.

Testing

To verify if the calculated watermark value is correct. Also, in different kernel versions, different parameters should be used (min_free_kbytes v.s. watermark_scale_factor).

Dependencies

None