Pre-filter disabled computes¶
This blueprint proposes to make nova report a trait to placement when a compute service is disabled and a request filter in the scheduler which will use that trait to filter out allocation candidates with that forbidden trait.
In a large deployment with several thousand compute nodes, the
[scheduler]/max_placement_results configuration option may be limited
such that placement returns allocation candidates which are mostly (or all)
disabled compute nodes, which can lead to a NoValidHost error during
As an operator, I want to limit
max_placement_results to improve scheduler
throughput but not suffer NoValidHost errors because placement only gives
back disabled computes.
As a developer, I want to pre-filter disabled computes in placement which
should be faster (in SQL) than the legacy
ComputeFilter running over the
results in python. In other words, I want to ask placement better questions
to get back more targeted results.
As a user, I want to be able to create and resize servers without hitting NoValidHost errors because the cloud is performing a rolling upgrade and has disabled computes.
Nova will start reporting a
COMPUTE_STATUS_DISABLED trait to placement
for any compute node resource provider managed by a disabled compute service
host. When the service is enabled, the trait will be removed.
For the compute service there are two changes.
The compute service already has a set_host_enabled method which is a synchronous RPC call. Historically this was only implemented by the xenapi driver for use with the (now deprecated) Update Host Status API.
This blueprint proposes to use that compute method to generically add/remove
COMPUTE_STATUS_DISABLED trait on the compute nodes managed by that
service (note that for ironic a compute service host can manage multiple
nodes). The trait will be managed on only the root compute node resource
provider in placement, not any nested providers.
The actual implementation will be part of the ComputeVirtAPI so that the libvirt driver has access to it when it automatically disables or enables the compute node based on events from the hypervisor. 
update_available_resource operation which is called during
service start and periodically, the update_provider_tree flow will sync
COMPUTE_STATUS_DISABLED trait based on the current disabled status
of the service. This is useful to:
Sync the trait on older disabled computes during the upgrade.
Sync the trait in case the API<>compute interaction fails for some reason, like a dropped RPC call.
When the os-services API(s) are used to enable or disable a compute service,
the API will synchronously call the compute service via the
set_host_enabled RPC method to reflect the trait on the
related compute node resource providers in placement appropriately. For
example, if compute service A is disabled, the trait will be added. When
compute service A is enabled, the trait will be removed.
See the Upgrade impact section for dealing with old computes during a rolling upgrade.
It is possible to disable a down compute service since currently that disable
operation is just updating the
services.disabled value in the cell
database. With this change, the API will have to check if the compute service
is up using the service group API. If the service is down, the API will not
set_host_enabled compute method and instead just update the
services.disabled value in the DB as today and return. When the compute
service is restarted, the update_provider_tree flow will sync the trait.
A request filter will be added which will modify the RequestSpec to forbid
providers with the
COMPUTE_STATUS_DISABLED trait. The changes to the
RequestSpec will not be persisted.
There will not be a new configuration option for the request filter meaning it will always be enabled.
In addition to filtering based on the disabled status of a node,
ComputeFilter also performs an is_up check using the
service group API. The result of the “is up” check depends on whether
or not the service was forced down or has not “reported in” within
some configurable interval meaning the service might be down. This
blueprint is not going to try and report the up/down status of a
compute service using the new trait since it gets fairly complicated
and is more of an edge case for unexpected outages.
Rather than using a forbidden trait, we could hard-code a resource provider aggregate UUID in nova and add/remove compute node resource providers to/from that aggregate in placement as the service is disabled/enabled.
Pros: Aggregates may be more natural since they are a grouping of providers.
Cons: Using an aggregate would be harder to debug from an operational perspective since provider aggregates do not have any name or metadata so an operator might wonder why a certain provider is not a candidate for scheduling but is in an aggregate they did not create (or do not see in the nova host aggregates API). Using a trait per provider with a clear name like
COMPUTE_STATUS_DISABLEDshould make it obvious to a human that the provider is not a scheduling candidate because it is disabled.
Rather than using a forbidden trait or aggregate, nova could set the reserved inventory on each provider equal to the total inventory for each resource class on that provider, like what the ironic driver does when a node is undergoing maintenance and should be taken out of scheduling consideration. 
Pros: No new traits, can just follow the ironic driver pattern.
Cons: Ironic node resource providers are expected to have a single resource class in inventory so it is easier to manage changing the reserved value on just that one class, but for non-baremetal providers they are reporting at least three resource classes (VCPU, MEMORY_MB and DISK_GB) so it would be more complicated to set reserved = total on all of those classes. Furthermore, changing the inventory is not configurable like a request filter is.
Long-term, we could consider changing the ironic driver node maintenance code to just set/unset the
Rather than the
os-servicesAPI synchronously calling the
set_host_enabledmethod on the compute service, the API could just toggle the trait on the affected providers directly.
Pros: No blocking calls from the API to the compute service when changing the disabled status of the service - although one could argue the blocking nature proposed in the spec is advantageous so the admin gets confirmation that the service is disabled and will be pre-filtered properly during scheduling.
Cons: Potential duplication of the code that manages the trait which could violate the principle of single responsibility.
Do nothing and instead focus efforts on optimizing the performance of the nova scheduler which is likely the root cause that large deployments need to severely limit
max_placement_results. However, regardless of optimizing the scheduler (which is something we should do anyway), part of making scheduling faster in nova is dependent on nova asking placement more informed questions and placement providing a smaller set of allocation candidates, i.e. filter in SQL (placement) rather than in python (nova).
Data model impact¶
REST API impact¶
Other end user impact¶
None. Operators can use the osc-placement CLI to view and manage provider traits directly.
In one respect this should improve scheduler performance during an upgrade or maintenance of a large cloud which has many disabled compute services since placement would be returning fewer allocation candidates for the nova scheduler to filter.
On the other hand, this would add overhead to the
os-services API when
changing the disabled status on a compute service.
Other deployer impact¶
There are a few upgrade considerations for this change.
The API will check the RPC API version of the target compute service and if it is old the
set_host_enabledmethod will not be called. When the compute service is upgraded and restarted, the
update_provider_treecall will sync the trait.
Existing disabled computes need to have the trait reported on upgrade which will happen via the
update_available_resourceflow (update_provider_tree) called on start of the compute after it is upgraded.
- Primary assignee:
Matt Riedemann (mriedem) <firstname.lastname@example.org>
- Other contributors:
Make the changes to the compute service:
The libvirt driver to callback to add/remove the trait when it is notified of the hypervisor going down or up
os-servicesAPI to call the
set_host_enabledcompute service method when the disabled status changes on a compute service
Add a request filter which will add a forbidden trait to the RequestSpec to filter out disabled compute node resource providers during the GET /allocation_candidates call to placement.
COMPUTE_STATUS_DISABLED trait would need to be added to the
Unit and functional tests should be sufficient for this feature.
The new scheduler request filter will be documented in the admin docs. 
The original bug reported by CERN: https://bugs.launchpad.net/nova/+bug/1805984
Initial proof of concept: https://review.opendev.org/654596/
Train PTG mailing list mention: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005908.html