.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ================================ Handling Reshaped Provider Trees ================================ https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree Virt drivers need to be able to change the structure of the provider trees they expose. When moving existing resources, existing allocations need to be moved along with the inventories. And this must be done in such a way as to avoid races where a second entity can create or remove allocations against the moving inventories. Problem description =================== Use Cases --------- * The libvirt driver currently inventories VGPU resources on the compute node provider. In order to exploit provider trees, libvirt needs to create one child provider per physical GPU and move the VGPU inventory from the compute node provider to these GPU child providers. In a live deployment where VGPU resources are already allocated to instances, the allocations need to be moved along with the inventories. * Drivers wishing to model NUMA must similarly create child providers and move inventory and allocations of several classes (processor, memory, VFs on NUMA-affined NICs, etc.) to those providers. * A driver is using a custom resource class. That resource class is added to the standard set (under a new, non-``CUSTOM_`` name). In order to use the standard name, the driver must move inventory and allocations from the old name to the new. These are just example cases that may exist now or in the future. We're describing a generic pivot system here. Proposed change =============== The overall flow is as follows. The parts in red only happen when a reshape is needed. This represents the happy path on compute startup only. .. image:: /_media/rocky/reshape-provider-tree.svg Note that, for Fast-Forward Upgrades, the ``Resource Tracker`` lane is actually the `Offline Upgrade Script`_. .. _`get_allocations_for_provider_tree()`: SchedulerReportClient.get_allocations_for_provider_tree() --------------------------------------------------------- A new SchedulerReportClient method shall be implemented:: def get_allocations_for_provider_tree(self): """Retrieve allocation records associated with all providers in the provider tree. :returns: A dict, keyed by consumer UUID, of allocation records. """ A consumer isn't always an instance (it may be a "migration" - or other things not created by Nova, in the future), so we can't just use the instance list as the consumer list. We can't get *all* allocations for associated sharing providers because some of those will belong to consumers on other hosts. So we have to discover all the consumers associated with the providers in the local tree:: for each "local" provider: GET /resource_providers/{provider.uuid}/allocations We can't use *just* those allocations because we would miss allocations for sharing providers. So we have to get all the allocations for just the consumers discovered above:: for each consumer in ^: GET /allocations/{consumer.uuid} .. note:: We will still miss data if **all** of a consumer's allocations live on sharing providers. I don't have a good way to close that hole. But that scenario won't happen in the near future, so it'll be noted as a limitation via a code comment. Return a dict, keyed by the ``{consumer.uuid}``, of the resulting allocation records. This is the form of the new `Allocations Parameter`_ expected by `update_provider_tree()`_ and `update_from_provider_tree()`_), and return it. ReshapeNeeded exception ----------------------- A new exception, ``ReshapeNeeded``, will be introduced. It is used as a signal from `update_provider_tree()`_ to indicate that a reshape must be performed. This is for performance reasons so that we don't `get_allocations_for_provider_tree()`_ unless it's necessary. .. _`update_provider_tree()`: Changes to update_provider_tree() --------------------------------- Allocations Parameter ~~~~~~~~~~~~~~~~~~~~~ A new ``allocations`` keyword argument will be added to ``update_provider_tree()``:: def update_provider_tree(self, provider_tree, nodename, allocations=None): If ``None``, the ``upgrade_provider_tree()`` method must not perform a reshape. If it decides a reshape is necessary, it must raise the new ``ReshapeNeeded`` exception. When not ``None``, the ``allocations`` argument is a dict, keyed by consumer UUID, of allocation records of the form:: { $CONSUMER_UUID: { # NOTE: The shape of each "allocations" dict below is identical to the # return from GET /allocations/{consumer_uuid}... "allocations": { $RP_UUID: { "generation": $RP_GEN, "resources": { $RESOURCE_CLASS: $AMOUNT, ... }, }, ... }, "project_id": $PROJ_ID, "user_id": $USER_ID, # ...except for this, which is coming in bp/add-consumer-generation "consumer_generation": $CONSUMER_GEN, }, ... } If ``update_provider_tree()`` is moving allocations, it must edit the ``allocations`` dict in place. .. note:: I don't love the idea of the method editing the dict in place rather than returning a copy, but it's consistent with how we're handling the ``provider_tree`` arg. Virt Drivers ~~~~~~~~~~~~ Virt drivers currently overriding ``update_provider_tree()`` will need to change the signature to accomodate the new parameter. That work will be done within the scope of this blueprint. As virt drivers begin to model resources in nested providers, their implementations will need to: * determine whether a reshape is necessary and raise ``ReshapeNeeded`` as appropriate; * perform the reshape by processing provider inventories and the specified allocations. That work is outside the scope of this blueprint. .. _`update_from_provider_tree()`: Changes to update_from_provider_tree() -------------------------------------- The ``SchedulerReportClient.update_from_provider_tree()`` method is changed to accept a new parameter ``allocations``:: def update_from_provider_tree(self, context, new_tree, allocations): """Flush changes from a specified ProviderTree back to placement. ... ... :param allocations: A dict, keyed by consumer UUID, of allocation records of the form returned by GET /allocations/{consumer_uuid}. The dict must represent the comprehensive final picture of the allocations for each consumer therein. A value of None indicates that no reshape is being performed. ... """ When ``allocations`` is ``None``, the behavior of ``update_from_provider_tree()`` is as it was previously (in Queens). .. _`Resource Tracker _update()`: Changes to Resource Tracker _update() ------------------------------------- The ``_update()`` method will get a new parameter, ``startup``, which is percolated down from ``update_available_resource()``. Where `update_provider_tree()`_ and `update_from_provider_tree()`_ are currently invoked, the code flow will be changed to approximately:: try: self.driver.update_provider_tree(prov_tree, nodename) except exception.ReshapeNeeded: if not startup: # Treat this like a regular exception during periodic raise LOG.info("Performing resource provider inventory and " "allocation data migration during compute service " "startup or FFU.") allocs = reportclient.get_allocations_for_provider_tree() self.driver.update_provider_tree(prov_tree, nodename, allocations=allocs) ... reportclient.update_from_provider_tree(context, prov_tree, allocs) Changes to _update_available_resource_for_node() ------------------------------------------------ This is currently where all exceptions for the `Resource Tracker _update()`_ periodic task are caught, logged, and otherwise ignored. We will add a new parameter, ``startup``, percolated down from ``update_available_resource()``, and a new ``except`` clause of the form:: except exception.ResourceProviderUpdateFailed: if startup: # Kill the compute service. raise # Else log a useful exception reporting what happened and maybe even how # to fix it; and then carry on. The purpose of this is to make exceptions in `update_from_provider_tree()`_ fatal on startup only. Placement POST /reshaper ------------------------ In a new placement microversion, a new ``POST /reshaper`` operation will be introduced. The payload is of the form:: { "inventories": [ $RP_UUID: { # This is the exact payload format for # PUT /resource_provider/$RP_UUID/inventories. # It should represent the final state of the entire set of resources # for this provider. In particular, omitting a $RC dict will cause the # inventory for that resource class to be deleted if previously present. "inventories": { $RC: { } } "resource_provider_generation": , }, $RP_UUID: { ... }, ], "allocations": [ # This is the exact payload format for POST /allocations $CONSUMER_UUID: { "project_id": $PROJ_ID, "user_id": $USER_ID, # This field is part of the consumer generation series under review, # not yet in the published POST /allocations payload. "consumer_generation": $CONSUMER_GEN, "allocations": { $RP_UUID: { "resources": { $RC: $AMOUNT, ... } }, $RP_UUID: { ... } } }, $CONSUMER_UUID: { ... } ] } In a single atomic transaction, placement replaces the inventories for each ``$RP_UUID`` in the ``inventories`` dict; and replaces the allocations for each ``$CONSUMER_UUID`` in the ``allocations`` dict. Return values: * ``204 No Content`` on success. * ``409 Conflict`` on any provider or consumer generation conflict; or if a concurrent transaction is detected. Appropriate error codes should be used for at least the former so the caller can tell whether a fresh ``GET`` is necessary before recalculating the necessary reshapes and retrying the operation. * ``400 Bad Request`` on any other failure. Direct Interface to Placement ----------------------------- To make the `Offline Upgrade Script`_ possible, we need to make placement accessible by importing Python code rather than as a standalone service. The quickest path forward is to use `wsgi-intercept`_ to allow HTTP interactions, using the `requests`_ library, to work with only database traffic going over the network. This allows client code to make changes to the placement data store using the same API, but without running a placement service. An implementation of this, as a context manager called `PlacementDirect`_, is merged. The context manager accepts an `oslo config`_, populated by the caller. This allows the calling code to control how it wishes to discover configuration settings, most importantly the database being used by placement. This implementation provides a quick solution to the immediate needs of offline use of `Placement POST /reshaper`_ while allowing options for prettier solutions in the future. Offline Upgrade Script ---------------------- To facilitate Fast Forward Upgrades, we will provide a script that can perform this reshaping while all services (except databases) are offline. It will look like:: nova-manage placement migrate_compute_inventory ...and operate as follows, for each nodename (one, except for ironic) on the host: * Spin up a SchedulerReportClient with a `Direct Interface to Placement`_. * Retrieve a ProviderTree via ``SchedulerReportClient.get_provider_tree_and_ensure_root()``. * Instantiate the appropriate virt driver. * Perform the algorithm noted in `Resource Tracker _update()`_, as if ``startup`` is ``True``. We may refer to https://review.openstack.org/#/c/501025/ for an example of an upgrade script that requires a virt driver. Alternatives ------------ Reshaper API ~~~~~~~~~~~~ Alternatives to `Placement POST /reshaper`_ were discussed in the `mailing list thread`_, the `etherpad`_, IRC, hangout, etc. They included: * Don't have an atomic placement operation - do the necessary operations one at a time from the resource tracker. Rejected due to race conditions: the scheduler can schedule against the moving inventories, based on incorrect capacity information due to the moving allocations. * "Lock" the moving inventories - either by providing a locking API or by setting ``reserved = total`` - while the resource tracker does the reshape. Rejected because it's a hack; and because recovery from partial failures would be difficult. * "Merge" forms of the new placement operation: * ``PATCH`` (or ``POST``) with `RFC 6902`_-style ``"operation", "path"[, "from", "value"]`` instructions. * ``PATCH`` (or ``POST``) with `RFC 7396`_ semantics. The JSON payload would look like a sparse version of that described in `Placement POST /reshaper`_, but with only changes included. * Other payload formats for the placement operation (see the `etherpad`_). We chose the one we did because it reuses existing payload syntax (and may therefore be able to reuse code) and it provides a full specification of the expected end state, which is RESTy. Direct Placement ~~~~~~~~~~~~~~~~ Alternatives to the ``wsgi-intercept`` model for the `Direct Interface to Placement`_: * Directly access the object methods (with some refactoring/cleanup). Rejected because we lose things like schema validation and microversion logic. * Create cleaner, pythonic wrappers around those object methods. Rejected (in the short term) for the sake of expediency. We might take this approach longer-term as/when the demand for direct placement expands beyond FFU scripting. * Use ``wsgi-intercept`` but create the pythonic wrappers outside of the REST layer. This is also a long-term option. Reshaping Via update_provider_tree() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * We considered passing allocations to `update_provider_tree()`_ every time, but gathering the allocations will be expensive, so we needed a way to do it only when necessary. Enter `ReshapeNeeded exception`_. * We considered running the check-and-reshape-if-needed algorithm on every periodic interval, but decided we should never need to do a reshape except on startup. Data model impact ----------------- None. REST API impact --------------- See `Placement POST /reshaper`_. Security impact --------------- None. Notifications impact -------------------- None. Other end user impact --------------------- See `Upgrade Impact`_. Performance Impact ------------------ The new `Placement POST /reshaper`_ operation has the potential to be slow, and to lock several tables. Its use should be restricted to reshaping provider trees. Initially we may use the reshaper from `update_from_provider_tree()`_ even if no reshape is being performed; but if this is found to be problematic for performance, we can restrict it to only reshape scenarios, which will be very rare. Gathering allocations, particularly in large deployments, has the potential to be heavy and slow, so we only do this at compute startup, and then only if `update_provider_tree()`_ indicates that a reshape is necessary. Other deployer impact --------------------- See `Upgrade Impact`_. Developer impact ---------------- See `Virt Drivers`_. Upgrade impact -------------- Live upgrades are covered. The `Resource Tracker _update()`_ flow will run on compute start and perform the reshape as necessary. Since we do not support skipping releases on live upgrades, any virt driver-specific changes can be removed from one release to the next. The `Offline Upgrade Script`_ is provided for Fast-Forward Upgrade. Since code is run with each release's codebase for each step in the FFU, any virt driver-specific changes can be removed from one release to the next. Note, however, that the script must **always be run** since only the virt driver, running on a specific compute, can determine whether a reshape is required for that compute. (If no reshape is necessary, the script is a no-op.) Implementation ============== Assignee(s) ----------- * `Placement POST /reshaper`_: jaypipes (SQL-fu), cdent (API plumbing) * `Direct Interface to Placement`_: cdent * Report client, resource tracker, virt driver parity: efried * `Offline Upgrade Script`_: dansmith * Reviews and general heckling: mriedem, bauzas, gibi, edleafe, alex_xu Work Items ---------- See `Proposed change`_. Dependencies ============ * `Consumer Generations`_ * `Nested Resource Providers - Allocation Candidates`_ Testing ======= Functional test enhancements for everyone, including gabbi tests for `Placement POST /reshaper`_. Live testing in Xen (naichuans) and libvirt (bauzas) via their VGPU work. Documentation Impact ==================== * `Placement POST /reshaper`_ (placement API reference) * `Offline Upgrade Script`_ (`nova-manage db`_) References ========== * `Consumer Generations`_ spec * `Nested Resource Providers - Allocation Candidates`_ * Placement reshaper API discussion `etherpad`_ * Upgrade concerns... `mailing list thread`_ * `RFC 6902`_ (``PATCH`` with ``json-patch+json``) * `RFC 7396`_ (``PATCH`` with ``merge-patch+json``) * `nova-manage db`_ migration helper docs * `wsgi-intercept`_ * Python `requests`_ * `PlacementDirect`_ implementation * `oslo config`_ library .. _`Consumer Generations`: http://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/add-consumer-generation.html .. _`Nested Resource Providers - Allocation Candidates`: http://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/nested-resource-providers-allocation-candidates.html .. _`etherpad`: https://etherpad.openstack.org/p/placement-migrate-operations .. _`mailing list thread`: http://lists.openstack.org/pipermail/openstack-dev/2018-May/130783.html .. _`RFC 6902`: https://tools.ietf.org/html/rfc6902 .. _`RFC 7396`: https://tools.ietf.org/html/rfc7396 .. _`nova-manage db`: https://docs.openstack.org/nova/latest/cli/nova-manage.html#nova-database .. _wsgi-intercept: https://pypi.org/project/wsgi_intercept/ .. _requests: http://docs.python-requests.org/ .. _PlacementDirect: https://review.openstack.org/#/c/572576/ .. _oslo config: https://docs.openstack.org/oslo.config/latest/ History ======= .. list-table:: Revisions :header-rows: 1 * - Release Name - Description * - Rocky - Introduced