Image Pre-caching support

https://blueprints.launchpad.net/nova/+spec/image-precache-support

Nova supports caching images on demand at the compute node level for performance reasons, but provides no ability to schedule that activity before a rollout or maintenance window. This long-requested feature becomes even more important when considering Edge Computing environments, limited bandwidth, as well as high-scale rapid application deployment.

Problem description

Several of the virt drivers in Nova support the caching of base images for improved boot performance. The first time an instance is booted from a given image, that base image is downloaded from glance, cached, and either copied or CoW’d to create the actual instance root disk. Subsequent instance boots from the same image can re-use the cached copy of the base image instead of downloading it again from Glance. This behavior provides the following benefits:

  • Decreased load on the Glance server(s)

  • Decreased network utilization

  • Decreased time-to-boot latency for the second and subsequent instances

The latter is particularly important for situations where new application rollouts must be performed within a specific time window, or where scale-up operations are expected to happen quickly in response to changing load conditions. Specifically, it can be important to ensure that a new image is cached on all of the relevant compute nodes prior to the upgrade window opening, or before load unexpectedly spikes.

Further, in situations where compute nodes may be remotely located in environments where network bandwidth is limited (such as many edge computing environments), it may be very important to push a new base image to those nodes during times of low utilization or a maintenance window, such that the image download process does not consume a massive amount of bandwidth during normal operation.

Because Nova does not provide a way to seed this process from the outside, operators are currently forced to hack around the problem. Some of the workarounds we know are being used include:

  • Pre-booting throwaway instances on each compute node by hand to seed the cache before deploying the real ones

  • Copying the images directly into the cache directories on the compute nodes out of band

  • Modifying the Nova code themselves to provide this functionality

  • Using a shared storage volume for the image cache (which is known to be broken)

  • Using a totally different ephemeral backend, such as ceph which side-steps the problem entirely (but requires a substantially larger investment)

Use Cases

  • As an operator of a cloud with remote compute nodes at the network edge, I want to be able to pre-cache images during maintenance windows in order to avoid the network spike involved with spinning up a new instance and pulling the base image on demand.

  • As a user of a cloud which supports an application that is frequently re-deployed en masse, I want to be able to pre-cache new images at computes before my rollout window to limit my application downtime to purely the time needed to respawn or rebuild instances.

Proposed change

This functionality has been proposed and requested multiple times, but failed to gain traction amongst the team for various reasons. Thus, this spec proposes a minimally viable initial implementation which addresses the need for pre-caching, but does not provide specific visibility, reporting, scheduling, or other advanced features.

Initially we will add a mechanism to Nova, by which a (sufficiently-privileged) user can request that a set of images be cached on the set of compute nodes contained within a host aggregate. This activity will be delegated to a (super-)conductor worker, which will:

  • Validate the images provided (for existence and accessibility, to avoid asking a ton of computes to do something impossible)

  • Look up the list of hosts in the given aggregate

  • Collate the hosts by cell

  • Iterate through those hosts making an RPC request to start the operation

If we were to fire off all those requests via RPC casts to be handled asynchronously, we would surely DDoS the image service. Throttling that appropriately could be done in many ways and is easily the subject of a dedicated subsequent spec. In this initial revision, we will introduce a configurable parallelism limit, which will cause conductor to contact that many computes in parallel to trigger their downloads, using the long-running RPC call functionality to wait for completion.

Images will be cached on each compute using a new method on the virt driver which, when implemented, will re-use the image downloading routines already employed during image boot. Images that are cached via this mechanism will be subjected to the same expiry and purge rules as those downloaded on demand as a result of instance boots. Subsequent calls to cache an image that is already resident should reset the expiry timer (if applicable) from the cache. In the case of the existing drivers that use the imagecache module, we will just need to touch them to update their mtime.

Alternatives

One alternative is always to do nothing. This has been requested and proposed many times in the past, and people are currently living without it and/or working around the limitation on their own.

Another option would be to take a similar approach, but dispense with the incremental nature. We could implement a larger API, with task and progress reporting, scheduling (image X should be cached for Y hours, etc) and other features that have been part of previous requests. The reason to not do this is to avoid the risk of never completing this because of the multitude of rabbit holes that open up with a larger scope. See the references section for a partial list of previous attempts that were never completed.

Data model impact

None in this initial iteration. In the future, it may be desirable to track images and status per-compute, which would require some accounting in the database.

REST API impact

It may technically make more sense to put this function under the images API in Nova. However, that is marked as deprecated currently. Since this is primarily based on the aggregate model, this proposes to add this as an action on an aggregate.

os-aggregates

A new route under the aggregate for images will be added for cache management.

  • POST /os-aggregates/{aggregate_id}/images (returns 202 on success)

    {
      "cache": [
        {"id": "a26887c6-c47b-4654-abb5-dfadf7d3f803"},
        {"id": "4d8c3732-a248-40ed-bebc-539a6ffd25c0"}
      ]
    }
    

Because we are attempting to provide a minimally-viable initial implementation, the structure of the request is defined so that it will be possible to add additional information in future versions. This may include additional per-image information (such as priority, TTL, etc) or information per-request, such as parallelism, download rate, etc.

Security impact

Obviously allowing any user to initiate a wide-scale moving of data brings some inherent risk. As this proposes to be aggregate-based, the user would likely need to already have at least the ability to list host aggregates in order to provide one to the caching API. A policy knob defining which users have that ability would default to the existing ones with ability to manage host aggregates.

Notifications impact

Without any API-based reporting of progress per-compute, emitting notifications about the start and completion of image downloads could be helpful. This would allow operators to monitor the process.

Other end user impact

The clients will obviously need to gain the ability to hit this API. Regular users should be entirely unaffected, other than potentially noticing improved boot performance.

Performance Impact

The primary goal of this change is to improve performance of instance boots after the images are pre-cached. Certainly during the pre-caching operation, there will be some additional load on the image service, conductor workers coordinating the task, as well as the computes doing the work. The actual image download operation on the computes will use the same code paths that are currently used during image boot.

Other deployer impact

Deployers will need to determine which users should be allowed to access this caching API, if any, and modify the policy accordingly.

Developer impact

This will require a new RPC method on the conductor, compute, and a corresponding call to the virt driver. Currently, the libvirt, hyperv, and vmwareapi drivers use the imagecache. Initial support will be provided for the libvirt driver, but should be relatively easy for the other two given they re-use the imagecache module.

Upgrade impact

As this initial revision of the function is best-effort, with no real reporting or guarantees that the images are cached and by any deadline, the upgrade impact is minimal. If the compute RPC API is pinned to a version lower than required to make this call, then no computes will be contacted to pre-cache the images.

If the caching call is made against computes running virt drivers that are not yet (or ever) able to participate, a warning log message will be emitted by the base virt driver.

Implementation

Assignee(s)

Primary assignee:

danms

Feature Liaison

Feature liaison:

danms

Work Items

  • Extend the base virt driver to contain a cache_image() method which takes an image id. Default behavior is a NotImplemented exception.

  • Implement the cache_image() method in the libvirt driver

  • Add the new RPC call to the compute manager which delegates to the virt driver. If NotImplemented is raised, a warning message is logged about the lack of support.

  • Add the new RPC call to the conductor manager to look up, collate per cell, and call to the relevant computes.

  • Add a new REST API call allowing the user to make this request.

  • Add a client implementation for making this call.

Dependencies

Patches against openstackclient, novaclient, and nova will be inter-dependent.

Testing

As this initial phase of implementation provides no externally-visible changes to a running deployment, testing with tempest would have to rely on something obscure like time-to-boot latency to determine success. Thus, functional tests will be added to ensure that the image cache is populated by the new call, and that subsequent instance boots do not contact the image service to perform the download.

Documentation Impact

This feature needs documentation for the operators in the admin guide, and of course api-ref changes.

References

History

Revisions

Release Name

Description

Ussuri

Introduced