Pre-filter disabled computes¶
https://blueprints.launchpad.net/nova/+spec/pre-filter-disabled-computes
This blueprint proposes to make nova report a trait to placement when a compute service is disabled and a request filter in the scheduler which will use that trait to filter out allocation candidates with that forbidden trait.
Problem description¶
In a large deployment with several thousand compute nodes, the
[scheduler]/max_placement_results
configuration option may be limited
such that placement returns allocation candidates which are mostly (or all)
disabled compute nodes, which can lead to a NoValidHost error during
scheduling.
Use Cases¶
As an operator, I want to limit max_placement_results
to improve scheduler
throughput but not suffer NoValidHost errors because placement only gives
back disabled computes.
As a developer, I want to pre-filter disabled computes in placement which
should be faster (in SQL) than the legacy ComputeFilter
running over the
results in python. In other words, I want to ask placement better questions
to get back more targeted results.
As a user, I want to be able to create and resize servers without hitting NoValidHost errors because the cloud is performing a rolling upgrade and has disabled computes.
Proposed change¶
Summary¶
Nova will start reporting a COMPUTE_STATUS_DISABLED
trait to placement
for any compute node resource provider managed by a disabled compute service
host. When the service is enabled, the trait will be removed.
A scheduler request filter will be added which will modify the RequestSpec to filter out providers with the new trait using forbidden trait filtering syntax.
Compute changes¶
For the compute service there are two changes.
set_host_enabled¶
The compute service already has a set_host_enabled method which is a synchronous RPC call. Historically this was only implemented by the xenapi driver for use with the (now deprecated) Update Host Status API.
This blueprint proposes to use that compute method to generically add/remove
the COMPUTE_STATUS_DISABLED
trait on the compute nodes managed by that
service (note that for ironic a compute service host can manage multiple
nodes). The trait will be managed on only the root compute node resource
provider in placement, not any nested providers.
The actual implementation will be part of the ComputeVirtAPI so that the libvirt driver has access to it when it automatically disables or enables the compute node based on events from the hypervisor. [1]
update_provider_tree¶
During the update_available_resource
operation which is called during
service start and periodically, the update_provider_tree flow will sync
the COMPUTE_STATUS_DISABLED
trait based on the current disabled status
of the service. This is useful to:
Sync the trait on older disabled computes during the upgrade.
Sync the trait in case the API<>compute interaction fails for some reason, like a dropped RPC call.
API changes¶
When the os-services API(s) are used to enable or disable a compute service,
the API will synchronously call the compute service via the
set_host_enabled
RPC method to reflect the trait on the
related compute node resource providers in placement appropriately. For
example, if compute service A is disabled, the trait will be added. When
compute service A is enabled, the trait will be removed.
See the Upgrade impact section for dealing with old computes during a rolling upgrade.
Down computes¶
It is possible to disable a down compute service since currently that disable
operation is just updating the services.disabled
value in the cell
database. With this change, the API will have to check if the compute service
is up using the service group API. If the service is down, the API will not
call the set_host_enabled
compute method and instead just update the
services.disabled
value in the DB as today and return. When the compute
service is restarted, the update_provider_tree flow will sync the trait.
Scheduler changes¶
A request filter will be added which will modify the RequestSpec to forbid
providers with the COMPUTE_STATUS_DISABLED
trait. The changes to the
RequestSpec will not be persisted.
There will not be a new configuration option for the request filter meaning it will always be enabled.
Note
In addition to filtering based on the disabled status of a node,
the ComputeFilter
also performs an is_up check using the
service group API. The result of the “is up” check depends on whether
or not the service was forced down or has not “reported in” within
some configurable interval meaning the service might be down. This
blueprint is not going to try and report the up/down status of a
compute service using the new trait since it gets fairly complicated
and is more of an edge case for unexpected outages.
Alternatives¶
Rather than using a forbidden trait, we could hard-code a resource provider aggregate UUID in nova and add/remove compute node resource providers to/from that aggregate in placement as the service is disabled/enabled.
Pros: Aggregates may be more natural since they are a grouping of providers.
Cons: Using an aggregate would be harder to debug from an operational perspective since provider aggregates do not have any name or metadata so an operator might wonder why a certain provider is not a candidate for scheduling but is in an aggregate they did not create (or do not see in the nova host aggregates API). Using a trait per provider with a clear name like
COMPUTE_STATUS_DISABLED
should make it obvious to a human that the provider is not a scheduling candidate because it is disabled.
Rather than using a forbidden trait or aggregate, nova could set the reserved inventory on each provider equal to the total inventory for each resource class on that provider, like what the ironic driver does when a node is undergoing maintenance and should be taken out of scheduling consideration. [2]
Pros: No new traits, can just follow the ironic driver pattern.
Cons: Ironic node resource providers are expected to have a single resource class in inventory so it is easier to manage changing the reserved value on just that one class, but for non-baremetal providers they are reporting at least three resource classes (VCPU, MEMORY_MB and DISK_GB) so it would be more complicated to set reserved = total on all of those classes. Furthermore, changing the inventory is not configurable like a request filter is.
Long-term, we could consider changing the ironic driver node maintenance code to just set/unset the
COMPUTE_STATUS_DISABLED
trait.Rather than the
os-services
API synchronously calling theset_host_enabled
method on the compute service, the API could just toggle the trait on the affected providers directly.Pros: No blocking calls from the API to the compute service when changing the disabled status of the service - although one could argue the blocking nature proposed in the spec is advantageous so the admin gets confirmation that the service is disabled and will be pre-filtered properly during scheduling.
Cons: Potential duplication of the code that manages the trait which could violate the principle of single responsibility.
Do nothing and instead focus efforts on optimizing the performance of the nova scheduler which is likely the root cause that large deployments need to severely limit
max_placement_results
[3]. However, regardless of optimizing the scheduler (which is something we should do anyway), part of making scheduling faster in nova is dependent on nova asking placement more informed questions and placement providing a smaller set of allocation candidates, i.e. filter in SQL (placement) rather than in python (nova).
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None. Operators can use the osc-placement CLI to view and manage provider traits directly.
Performance Impact¶
In one respect this should improve scheduler performance during an upgrade or maintenance of a large cloud which has many disabled compute services since placement would be returning fewer allocation candidates for the nova scheduler to filter.
On the other hand, this would add overhead to the os-services
API when
changing the disabled status on a compute service.
Other deployer impact¶
None
Developer impact¶
None
Upgrade impact¶
There are a few upgrade considerations for this change.
The API will check the RPC API version of the target compute service and if it is old the
set_host_enabled
method will not be called. When the compute service is upgraded and restarted, theupdate_provider_tree
call will sync the trait.Existing disabled computes need to have the trait reported on upgrade which will happen via the
update_available_resource
flow (update_provider_tree) called on start of the compute after it is upgraded.
Implementation¶
Assignee(s)¶
- Primary assignee:
Matt Riedemann (mriedem) <mriedem.os@gmail.com>
- Other contributors:
None
Work Items¶
Make the changes to the compute service:
The
set_host_enabled
methodThe
update_provider_tree
flowThe libvirt driver to callback to add/remove the trait when it is notified of the hypervisor going down or up
Plumb the
os-services
API to call theset_host_enabled
compute service method when the disabled status changes on a compute serviceAdd a request filter which will add a forbidden trait to the RequestSpec to filter out disabled compute node resource providers during the GET /allocation_candidates call to placement.
Dependencies¶
The COMPUTE_STATUS_DISABLED
trait would need to be added to the
os-traits library.
Testing¶
Unit and functional tests should be sufficient for this feature.
Documentation Impact¶
The new scheduler request filter will be documented in the admin docs. [4]
References¶
Footnotes¶
Other¶
The original bug reported by CERN: https://bugs.launchpad.net/nova/+bug/1805984
Initial proof of concept: https://review.opendev.org/654596/
Train PTG mailing list mention: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005908.html
History¶
Release Name |
Description |
---|---|
Train |
Introduced |