Placement Allocation Requests

https://blueprints.launchpad.net/nova/+spec/placement-allocation-requests

We propose to have the placement API return to the scheduler a set of alternative allocation choices that the scheduler may then use to both make a fitness decision as well as attempt a claim of resources on multiple complex resource providers.

Problem description

Nova’s scheduler will soon be claiming resources by sending a POST /allocations/{consumer_uuid} request to the Placement API after selecting a target compute host. The Nova scheduler constructs the claim request for only a single resource provider at the moment: the provider representing the target compute host that it selected. Only claiming against a single resource provider is problematic; as we move to representing more and more complex resource provider relationships (nested providers and providers of shared resources), we want the Nova scheduler to be able to claim resources against these nested or sharing resource providers.

In order for this to happen, we propose creating a new REST API endpoint in the Placement API called GET /allocation_candidates that will return a collection of opaque (to the Nova compute node and conductor) HTTP request bodies that can be provided to a POST /allocations/{consumer_uuid} request along with a set of information the Nova scheduler can use to make fitness choices for the launch requests.

Use Cases

This is an internal blueprint/spec, not intended to implement for any particular use case but rather simplify and structure the communication between the Nova scheduler and the Placement API.

Proposed change

We propose adding a new GET /allocation_candidates REST API endpoint that will return both a collection of opaque request bodies that can be sent to the POST /allocations/{consumer_uuid} endpoint as well as a collection of information that the scheduler can use to determine best fit for an instance launch request.

Note

At this time, we make no suggestion as to how the scheduler will use the information returned back from the placement API in its fitness decision. It may choose to replace the information that it currently uses from the cell databases with information from the placement API, or it could choose to merge the information somehow. That piece is left for future discussion.

The scheduler shall then proceed to choose an appropriate destination host for a build request (or more than one destination host if the RequestSpec.num_instances is greater than 1). However, instead of immediately returning this destination host, the scheduler will now work with the placement API to claim resources on the chosen host before sending its decision back to the conductor.

The scheduler will claim resources against the destination host by choosing an allocation request that contains the UUID of the destination host and calling the placement API’s POST /allocations/{consumer_uuid} call, passing in the allocation request as the body of the HTTP request along with the user and project ID of the instance.

If the attempt to claim resources fails due to a concurrent update (a condition that is normal and expected in environments with heavy load), the scheduler will retry the claim request several times and then, if still unable to claim resources against the initially-selected destination host, will move to the next host in its list of weighed hosts for the request.

Alternatives

There were a number of alternative approaches considered by the team.

Alternative 1 was to have the Placement API transparently claim resources on more than one provider. The scheduler would pick the primary resource provider (compute node), attempt to POST /allocations/{consumer_uuid} to claim resources against that compute node, and the placement API would write allocation records for resources against that compute node resource provider as well as sharing resource providers (e.g. in the case of a shared storage pool) and child providers (e.g. consuming SRIOV_NET_VF resources from a particular SRIOV physical function child provider). While this alternative would shield from the Nova scheduler implementation details about sharing providers and nested provider hierarchies, the Placement API is not well-suited to make decisions about things like packing/spreading strategies or picking a particular SRIOV PF for a target network function workload. Instead, the Nova scheduler is responsible for sorting the list of providers it receives from the Placement API that meet resource and trait requirements and choosing which providers to allocate against.

Alternative 2 was to modify the existing GET /resource_providers Placement REST API endpoint to return information about sharing providers and child providers and have the scheduler reporting client contain the necessary logic to build provider hierarchies, determine which sharing provider is associated with which providers, and essentially re-build a representation of usage and inventory records in memory. This alternative kept the Placement API free of much complex logic but came at the cost of dramatically changing the returned response from an established REST API endpoint and making the usage of that REST API endpoint inconsistent depending on the caller.

Data model impact

None.

REST API impact

The new GET /allocation_candidates Placement REST API endpoint shall accept requests with the following query parameters:

  • resources: A comma-delimited string of RESOURCE_CLASS:AMOUNT pairs, one for each class of resource requested. Example: ?resources=VCPU:1,MEMORY_MB:1024,DISK_GB:100

Given an HTTP request of:

GET /allocation_candidates?resources=$RESOURCES

where $RESOURCES = “VCPU:4,MEMORY_MB:16384,DISK_GB:100” and given two empty compute nodes each attached via an aggregate to a resource provider sharing DISK_GB resources, the following would be the HTTP response returned by the placement API:

{
    "allocation_requests": [
        {
            "allocations": [
                {
                    "resource_provider": {
                        "uuid": $COMPUTE_NODE1_UUID
                    },
                    "resources": {
                        "VCPU": $AMOUNT_REQUESTED_VCPU,
                        "MEMORY_MB": $AMOUNT_REQUESTED_MEMORY_MB
                    }
                },
                {
                    "resource_provider": {
                        "uuid": $SHARED_STORAGE_UUID
                    },
                    "resources": {
                        "DISK_GB": $AMOUNT_REQUESTED_DISK_GB
                    }
                },
            ],
        },
        {
            "allocations": [
                {
                    "resource_provider": {
                        "uuid": $COMPUTE_NODE2_UUID
                    },
                    "resources": {
                        "VCPU": $AMOUNT_REQUESTED_VCPU,
                        "MEMORY_MB": $AMOUNT_REQUESTED_MEMORY_MB
                    }
                },
                {
                    "resource_provider": {
                        "uuid": $SHARED_STORAGE_UUID
                    },
                    "resources": {
                        "DISK_GB": $AMOUNT_REQUESTED_DISK_GB
                    }
                },
            ],
        },
    ],
    "provider_summaries": {
        $COMPUTE_NODE1_UUID: {
            "resources": {
                "VCPU": {
                    "capacity": 120,   # NOTE, this represents the total - reserved * allocation_ratio
                    "used": 4,
                },
                "MEMORY_MB": {
                    "capacity": 1024,
                    "used": 48,
                }
            }
        },
        $COMPUTE_NODE2_UUID: {
            "resources": {
                "VCPU": {
                    "capacity": 120,
                    "used": 4,
                },
                "MEMORY_MB": {
                    "capacity": 1024,
                    "used": 48,
                }
            }
        },
        $SHARED_STORAGE_UUID: {
            "resources": {
                "DISK_GB": {
                    "capacity": 2000,
                    "used": 100,
                }
            }
        }
    }
]

Note that we are not dealing with either nested resource providers or traits in the above. Those concepts will be added to the response in future patches.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

Returning a list of allocation requests that all meet the Nova scheduler’s request for resources/traits and allowing the Nova scheduler to iterate over these allocation requests, retrying them if a concurrent claim happens, should actually increase the throughput of the Nova scheduler by reducing the amount of time between resource constraint retries.

Other deployer impact

The Placement service will need to be upgraded before the nova-scheduler service.

Developer impact

None.

Implementation

Assignee(s)

jaypipes

Work Items

  1. Implement the API logic in the Placement service with a new microversion.

  2. Update the FilterScheduler driver to use the new Placement API.

Testing

Unit and in-tree functional tests. Integration testing will be covered by existing Tempest testing.

Documentation Impact

There should be good devref documentation written that describes in more explicit detail what the placement service is responsible for and what the Nova scheduler is responsible for, and how this new API call will be used to shared information between placement and Nova scheduler.

References

History

Revisions

Release Name

Description

Pike

Introduced