..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

============================================
Resource Providers - Scheduler Filters in DB
============================================

https://blueprints.launchpad.net/nova/+spec/resource-providers-scheduler-db-filters

This blueprint aims to have the scheduler calling the placement API for getting
the list of resource providers that could allow to pre-filter compute nodes
from evaluation during `select_destinations()`.

Problem description
===================

Currently, on each call to the scheduler's `select_destinations()` RPC method,
the scheduler retrieves a list of `ComputeNode` objects, one object for *every*
compute node in the entire deployment. The scheduler constructs a set of
`nova.scheduler.host_manager.HostState` objects, one for each compute node.
Once the host state objects are constructed, the scheduler loops through them,
passing the host state object to the collection of
`nova.scheduler.filters.Filter` objects that are enabled for the deployment.

Many of these scheduler filters do nothing more than calculate the amount of a
particular resource that a compute node has available to it and return `False`
if the amount requested is greater than the available amount of that type of
resource.

Having to return all compute node records in the entire deployment is
extremely wasteful and this inefficiency gets worse the larger the deployment
is. The filter loop is essentially implementing a `SQL` `WHERE` clause, but in
Python instead of a more efficient database query.

Use Cases
----------

As a CERN user, I don't want to wait for the nova-scheduler to process 10K+
compute nodes to find a host on which to build my server.

Proposed change
===============

We propose to winnow the set of compute nodes the FilterScheduler evaluates by
only returning the compute node resource providers that meet requested resource
constraints.  This will dramatically reduce the amount of compute node records
that need to be pulled from the database on every call to
`select_destinations()`.  Instead of doing that database call, we would rather
make a HTTP call to the placement API on a specific REST resource with a
request that would return the list of resource providers' UUIDs that would
match requested resources and traits criterias based on the original
RequestSpec object.

This blueprint doesn't aim to change the CachingScheduler driver, which
overrides the method that fetches the list of hosts. That means the
CachingScheduler will *not* call the placement API.

Alternatives
------------

We could create an entirely new scheduler driver instead of modifying the
`FilterScheduler`. Jay is not really in favor of this approach because it
introduces more complexity to the system than directly using the placement API
for that purpose.

Data model impact
-----------------

None.

REST API impact
---------------

None.

Security impact
---------------

None.

Notifications impact
--------------------

None.

Other end user impact
---------------------

None.

Performance Impact
------------------

Jay built a benchmarking harness_ that demonstrates that the more compute nodes
in the deployment, the greater the gains are from doing filtering on the
database side versus doing the filtering on the Python side and returning a
record for each compute node in the system. That is directly reading the DB but
we assume the extra HTTP penalty as something not really impactful.

.. _harness: http://github.com/jaypipes/placement-bench

Other deployer impact
---------------------

In Pike, the CoreFilter, RAMFilter and DiskFilter scheduler filters will be
removed from the list of default scheduler filters. Of course, for existing
deployments they will continue to have those filters in their list of enabled
filters. We will log a warning saying those filters are now redundant and can
safely be removed from the nova.conf file.

For deployers who disabled the RAMFilter, DiskFilter or CoreFilter, they may
manually want to set the allocation ratio for the appropriate inventory records
to a very large value to simulate not accounting for that particular resource
class in scheduling decisions. For instance, if a deployer disabled the
DiskFilter in their deployment because they don't care about disk usage, they
would set the `allocation_ratio` to 10000.0 for each inventory record of
`DISK_GB` resource class for all compute nodes in their deployment via the new
placement REST API.

These changes are designed to be introduced into Nova in a way that
"self-heals". In Newton, the placement REST API was introduced and the
nova-computes would begin writing inventory and allocation records to the
placement API for their VCPU, MEMORY_MB, and DISK_GB resources. If the
placement service was not set up, the nova-compute logged a warning about the
placement service needing to be started and a new service endpoint created in
Keystone so that the nova-computes could find the placement API.

In Ocata, the placement service is required, however we will build a sort of
self-healing process into the new behaviour of the scheduler calling to the
placement API to winnow the set of compute hosts that are acted upon. If the
placement service has been set up but all nova-computes have yet to be upgraded
to Ocata, the scheduler will continue to use its existing behaviour of querying
the Nova cell database compute_nodes table. Once all nova-compute workers have
been upgraded to Ocata, the new Ocata scheduler will attempt to contact the
placement service to get a list of resource providers (compute hosts) that meet
a set of requested resource amounts.

In the scenario of a freshly-upgraded Ocata deployment that previously had not
had the placement service established (and thus no nova-computes had
successfully written records to the placement database), the scheduler may
momentarily return a NoValidHosts while the placement database is populated.

As restarts (or upgrades+restarts) of the nova-computes are rolled out, the
placement database will begin to fill up with allocation and inventory
information. Please note that the scheduler will not use the placement API for
decisions until **all** nova-compute workers have been upgraded to Ocata. There
is a check for service version in the scheduler that requires all nova-computes
in the deployment to be upgraded to Ocata before the scheduler will begin using
the placement API for scheduling decisions.

Developer impact
----------------

None.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  bauzas

Other contributors:
  cdent
  jaypipes

Work Items
----------

* Add a new method that accepts a `nova.objects.RequestSpec` object and
  transform that object into a list of resource and traits criteria
* Provide a method to call the placement API for getting the list of
  resource providers that match those criteria.
* Translate that list of resource providers into a list of hosts and replace
  the existing DB call by the HTTP call for the FilterScheduler driver only.
* Leave NUMA and PCI device filters on the Python side of the scheduler for now
  until the `nested-resource-providers` blueprint is completed. We can have
  separate blueprints for handling NUMA and PCI resources via filters on the
  DB side.


Dependencies
============

The following blueprints are dependencies for this work:

* `resource-providers-get-by-request`

Testing
=======

Existing functional tests should adequately validate that swapping out DB-side
filtering for Python-side filtering of RAM, vCPU and local disk produces no
different scheduling results from `select_destinations()` calls.

Documentation Impact
====================

Make sure we document the redundant filter log warnings and how to remedy as
well as document how to use the `allocation_ratio` to simulate disabled
filters.

References
==========

None.

History
=======

.. list-table:: Revisions
   :header-rows: 1

   * - Release Name
     - Description
   * - Newton
     - Introduced
   * - Ocata
     - Re-proposed