Resource Providers - Scheduler Filters in DB¶
This blueprint aims to have the scheduler calling the placement API for getting the list of resource providers that could allow to pre-filter compute nodes from evaluation during select_destinations().
Currently, on each call to the scheduler’s select_destinations() RPC method, the scheduler retrieves a list of ComputeNode objects, one object for every compute node in the entire deployment. The scheduler constructs a set of nova.scheduler.host_manager.HostState objects, one for each compute node. Once the host state objects are constructed, the scheduler loops through them, passing the host state object to the collection of nova.scheduler.filters.Filter objects that are enabled for the deployment.
Many of these scheduler filters do nothing more than calculate the amount of a particular resource that a compute node has available to it and return False if the amount requested is greater than the available amount of that type of resource.
Having to return all compute node records in the entire deployment is extremely wasteful and this inefficiency gets worse the larger the deployment is. The filter loop is essentially implementing a SQL WHERE clause, but in Python instead of a more efficient database query.
As a CERN user, I don’t want to wait for the nova-scheduler to process 10K+ compute nodes to find a host on which to build my server.
We propose to winnow the set of compute nodes the FilterScheduler evaluates by only returning the compute node resource providers that meet requested resource constraints. This will dramatically reduce the amount of compute node records that need to be pulled from the database on every call to select_destinations(). Instead of doing that database call, we would rather make a HTTP call to the placement API on a specific REST resource with a request that would return the list of resource providers’ UUIDs that would match requested resources and traits criterias based on the original RequestSpec object.
This blueprint doesn’t aim to change the CachingScheduler driver, which overrides the method that fetches the list of hosts. That means the CachingScheduler will not call the placement API.
We could create an entirely new scheduler driver instead of modifying the FilterScheduler. Jay is not really in favor of this approach because it introduces more complexity to the system than directly using the placement API for that purpose.
Data model impact¶
REST API impact¶
Other end user impact¶
Jay built a benchmarking harness that demonstrates that the more compute nodes in the deployment, the greater the gains are from doing filtering on the database side versus doing the filtering on the Python side and returning a record for each compute node in the system. That is directly reading the DB but we assume the extra HTTP penalty as something not really impactful.
Other deployer impact¶
In Pike, the CoreFilter, RAMFilter and DiskFilter scheduler filters will be removed from the list of default scheduler filters. Of course, for existing deployments they will continue to have those filters in their list of enabled filters. We will log a warning saying those filters are now redundant and can safely be removed from the nova.conf file.
For deployers who disabled the RAMFilter, DiskFilter or CoreFilter, they may manually want to set the allocation ratio for the appropriate inventory records to a very large value to simulate not accounting for that particular resource class in scheduling decisions. For instance, if a deployer disabled the DiskFilter in their deployment because they don’t care about disk usage, they would set the allocation_ratio to 10000.0 for each inventory record of DISK_GB resource class for all compute nodes in their deployment via the new placement REST API.
These changes are designed to be introduced into Nova in a way that “self-heals”. In Newton, the placement REST API was introduced and the nova-computes would begin writing inventory and allocation records to the placement API for their VCPU, MEMORY_MB, and DISK_GB resources. If the placement service was not set up, the nova-compute logged a warning about the placement service needing to be started and a new service endpoint created in Keystone so that the nova-computes could find the placement API.
In Ocata, the placement service is required, however we will build a sort of self-healing process into the new behaviour of the scheduler calling to the placement API to winnow the set of compute hosts that are acted upon. If the placement service has been set up but all nova-computes have yet to be upgraded to Ocata, the scheduler will continue to use its existing behaviour of querying the Nova cell database compute_nodes table. Once all nova-compute workers have been upgraded to Ocata, the new Ocata scheduler will attempt to contact the placement service to get a list of resource providers (compute hosts) that meet a set of requested resource amounts.
In the scenario of a freshly-upgraded Ocata deployment that previously had not had the placement service established (and thus no nova-computes had successfully written records to the placement database), the scheduler may momentarily return a NoValidHosts while the placement database is populated.
As restarts (or upgrades+restarts) of the nova-computes are rolled out, the placement database will begin to fill up with allocation and inventory information. Please note that the scheduler will not use the placement API for decisions until all nova-compute workers have been upgraded to Ocata. There is a check for service version in the scheduler that requires all nova-computes in the deployment to be upgraded to Ocata before the scheduler will begin using the placement API for scheduling decisions.
- Primary assignee:
- Other contributors:
Add a new method that accepts a nova.objects.RequestSpec object and transform that object into a list of resource and traits criteria
Provide a method to call the placement API for getting the list of resource providers that match those criteria.
Translate that list of resource providers into a list of hosts and replace the existing DB call by the HTTP call for the FilterScheduler driver only.
Leave NUMA and PCI device filters on the Python side of the scheduler for now until the nested-resource-providers blueprint is completed. We can have separate blueprints for handling NUMA and PCI resources via filters on the DB side.
The following blueprints are dependencies for this work:
Existing functional tests should adequately validate that swapping out DB-side filtering for Python-side filtering of RAM, vCPU and local disk produces no different scheduling results from select_destinations() calls.
Make sure we document the redundant filter log warnings and how to remedy as well as document how to use the allocation_ratio to simulate disabled filters.