Cross-cell resize¶
https://blueprints.launchpad.net/nova/+spec/cross-cell-resize
Expand resize (cold migration) support across multiple cells.
Problem description¶
Multi-cell support was added to the controller services (API, conductor, scheduler) in the Pike release. However, server move operations, like resize, are restricted to the cell in which the instance currently lives. Since it is common for deployments to shard cells by hardware types, and therefore flavors isolated to that hardware, the inability to resize across cells is problematic when a deployment wants to move workloads off the old hardware (flavors) and onto new hardware.
Use Cases¶
As a large multi-cell deployment which shards cells by hardware generation, I want to decommission old hardware in older cells and have new and existing servers move to newer cells running newer hardware using newer flavors without users having to destroy and recreate their workloads.
As a user, I want my servers to retain their IPs, volumes and UUID while being migrated to another cell.
Proposed change¶
This is a complicated change, which the proof of concept patch [1] shows. As such, I will break down this section into sub-sections to cover the various aspects in what needs to be implemented and why.
Keep in mind that at a high level, this is mostly a large data migration from one cell to another.
This spec attempts to provide a high level design based on prototypes using both a shelve-based approach and, after initial spec review [2], an approach modeled closer to the traditional resize flow. This version of the spec focuses on the latter approach (without directly calling shelve methods). There will be unforeseen issues that will arise during implementation so the spec tries to not get into too low a level of implementation details and instead focuses on the general steps needed and known issues. Open questions are called out as necessary.
Why resize?¶
We are doing resize because cells can be sharded by flavors and resize is the only non-admin way (by default) for users to opt into migrating from one cell with an old flavor (gen1) to a new flavor (gen2) in a new cell.
Users, by naturally resizing their servers every once in a while, will help with the migration. If not, the admins can also give them “new” flavors and tell them they need to resize their servers by a certain date. By having this be doable by a user-controlled process, the users can help with the migration without knowing anything about cells underneath.
Terms¶
Common terms used throughout the spec.
Source cell: this is the cell in which the instance “lives” when the resize is initiated.
Target cell: this is the cell in which the instance moves during a cross-cell resize.
Resized instance: an instance with status
VERIFY_RESIZE
.Super conductor: in a split-MQ deployment, the super conductor is running at the “top” and has access to the API database and thus can communicate with the cells over RPC and directly to the cell databases.
Hard delete: most tables in the cell database schema, like
instances
, use the SoftDeleteMixin which means when the corresponding object is destroyed, the record’sdeleted
column is updated to a non-0 value which takes it out of most queries by default, like when listing servers in the API. This is commonly referred to as a “soft delete” of the record since it is still in the table and will not be actually deleted until thenova-manage db archive_deleted_rows
command is run. Note that this concept of a “soft deleted” record is not the same as a server with statusSOFT_DELETED
in the API, which is a server that is marked for deletion but has not yet been reaped by the nova-compute service. Hard deleting cell database records is necessary in a cross-cell resize to avoid DBDuplicateEntry failures due to unique constraint violations because of soft deleted records in a cell database to which an instance is being resized.
Assumptions¶
There is a lack of connectivity, e.g. SSH access, between compute hosts in different cells on which regular resize relies.
The image service (glance), persistent volume storage (cinder) and tenant networks (neutron) span cells.
Goals¶
Minimal changes to the overall resize flow as seen from both an external (API user, notification consumer) and internal (nova developer) perspective.
Maintain the ability to easily rollback to the source cell in case the resize fails.
Policy rule¶
A new policy rule compute:servers:resize:cross_cell
will be added. It will
default to !
which means no users are allowed. This is both backward
compatible and flexible so that operators can determine which users in their
cloud are allowed to perform a cross-cell resize. For example, it probably
makes sense for operators to allow only system-level admins or test engineers
to perform a cross-cell resize initially.
Resize flow¶
This describes the flow of a resize up until the point that the server
goes to VERIFY_RESIZE
status.
API¶
The API will check if the user is allowed, by the new policy rule, to perform
a cross-cell resize and if the nova-compute
service on the source host
is new enough to support the cross-cell resize flow. If so, the API will
modify the RequestSpec to tell the scheduler to not restrict hosts to the
source cell, but the source cell will be “preferred” by default.
The 2.56 microversion allows users with the admin role to specify a target host during a cold migration. Currently, the API validates that the target host exists which will only work for hosts in the same cell in which the instance lives (because the request context is targeted to that cell). If the request is allowed to perform a cross-cell resize then we will adjust the target host check to allow for other cells as well.
Scheduler¶
A new CrossCellWeigher
will be introduced which will prefer hosts from the
source cell by default. A configurable multiplier will be added to control the
weight in case an operator wants to prefer cross cell migrations. This weigher
will be a noop for all non-cross-cell move operations.
Note that once the scheduler picks a primary selected host, all alternate hosts come from the same cell.
(Super)Conductor¶
The role of conductor will be to synchronously orchestrate the resize between the two cells. Given the assumption that computes in different cells do not have SSH access to each other, the traditional resize flow of transferring disks over SSH will not work.
The MigrationTask
will check the selected destinations from the scheduler
to see if they are in another cell and if so, call off to a new set of
conductor tasks to orchestrate the cross-cell resize. Conductor will set
Migration.cross_cell_move=True
which will be used in the API to control
confirm/revert logic.
A new CrossCellMigrationTask
will orchestrate the following sub-tasks which
are meant to mimic the traditional resize flow and will leverage new compute
service methods.
Target DB Setup
Before we can perform any checks in the destination host, we have to first populate the target cell database with the instance and its related data, e.g. block device mappings, network info cache, instance actions, etc.
Note
After this point, if anything fails the conductor task will hard delete the instance and its related records from the target cell DB so the resize can be attempted again once the issue is resolved in the target cell.
In order to hide the target cell instance from the API when listing servers,
the instance in the target cell will be created with a hidden=True
field
which will be used to filter out these types of instances from the API.
Remember that at this point, the instance mapping in the API points at the
source cell, so GET /servers/{server_id}
would still only show details
about the instance in the source cell. We use the new hidden
field to
prevent leaking out the wrong instance to GET /servers/detail
. We may also
do this for the related migrations
table record to avoid returning multiple
instances of the same migration record to GET /os-migrations
(coincidentally the migrations
table already has an unused hidden
column).
Prep Resize at Dest
Conductor will make a synchronous RPC call (using long_rpc_timeout
) to a
new method prep_snapshot_based_resize_at_dest
on the dest compute service
which will:
Call
ResourceTracker.resize_claim()
on the potential dest host in the target cell to claim resources prior to starting the resize. Note that VCPU, MEMORY_MB and DISK_GB resources will actually be claimed (allocated) via placement during scheduling, but we need to make theresize_claim()
for NUMA/PCI resources which are not yet modeled in placement, and in order to create theMigrationContext
record.Verify the selected target host to ensure ports and volumes will work. This validation will include creating port bindings on the target host and ensuring volume attachments can be connected to the host.
If either of these steps fail, the target host will be rejected. At that point,
the conductor task will loop through alternate hosts looking for one that
works. If the migration fails at this point (runs out of hosts), then the
migration status changes to error
and the instance status goes back to
its previous state (either ACTIVE
or ERROR
).
Copy the instance.migration_context
from the target DB to the source DB.
This is necessary for the API to route network-vif-plugged
events later
when spawning the guest in the target cell.
Prep Resize at Source
Conductor will make a synchronous RPC call (using long_rpc_timeout
) to a
new method prep_snapshot_based_resize_at_source
on the source compute
service which will behave very similar to how shelve works, but also coincides
with how the resize_instance
method works during a traditional resize:
Power off the instance.
For non-volume-backed instances, create and upload a snapshot image of the root disk. Like shelve, this snapshot image will be used temporarily during the resize and upon successful completion will be deleted. The old/new image_ref will be stored in the migration_context.
Destroy the guest on the hypervisor but retain disks, i.e. call
self.driver.destroy(..., destroy_disks=False)
. This is necessary to disconnect volumes and unplug VIFs from the source host, and is actually very similar to themigrate_disk_and_power_off
method called on the source host during a normal resize. Note that we do not free up tracked resources on the source host at this point nor change the instance host/node values in the database in case we revert or need to recover from a failed migration.Delete old volume attachments and update the BlockDeviceMapping records with new placeholder volume attachments which will be used on the dest host.
Open question: at this point we may want to activate port bindings for the dest host, but that may not be necessary (that is not done as part of
resize_instance
on the source host during traditional resize today). If the ports are bound to the dest host and the migration fails, trying to recover the instance in the source cell via rebuild may not work (see bug 1659062) so maybe port binding should be delayed, or we have to be careful about rolling those back to the source host.
If the migration fails at this point, any snapshot image created should be
deleted. Recovering the guest on the source host should be as simple as
hard rebooting the server (which is allowed with servers in ERROR
status).
Finish Resize at Dest
At this point we are going to switch over to the dest host in the target cell so we need to make sure any DB updates required from the source cell to the target cell are made, for example, task_state, power_state, availability_zone values, instance action events, etc
Conductor will make a synchronous RPC call (using long_rpc_timeout
) to a
new method finish_snapshot_based_resize_at_dest
on the dest compute service
which will behave very similar to how unshelve works, but also coincides with
how the finish_resize
method works during a traditional resize:
Apply the migration context and update the instance record for the new flavor and host/node information.
Update port bindings / PCI mappings for the dest host.
Prepare block devices (attach volumes).
Spawn the guest on the hypervisor which will connect volumes and plug VIFs. The new flavor will be used and if a snapshot image was previously created for a non-volume-backed instance, that image will be used for the root disk. At this point, the virt driver should wait for the
network-vif-plugged
event to be routed from the API before continuing.Delete the temporary snapshot image (if one was created). This is similar to how unshelve works where the shelved snapshot image is deleted. At this point deleting the snapshot image is OK since the guest is spawned on the dest host and in the event of a revert or recovery needed on the source, the source disk is still on the source host.
Mark the instance as resized.
Back in conductor, we need to:
Mark the target cell instance record as
hidden=False
so it will show up when listing servers. Note that because of how the API filters duplicate instance records, even if the user is listing servers at this exact moment only one copy of the instance will be returned.Update the instance mapping to point at the target cell. This is so that the confirm/revert actions will be performed on the resized instance in the target cell rather than the destroyed guest in the source cell. Note that we could do this before finishing the resize on the dest host, but it makes sense to defer this until the instance is successfully resized in the dest host because if that fails, we want to be able to rebuild in the source cell to recover the instance.
Mark the source cell instance record as
hidden=True
to hide it from the user when listing servers.
Confirm flow¶
When confirming a resized server, if the Migration.cross_cell_move
value
is True, the API will:
RPC call to the source compute to cleanup disks similar to the
driver.confirm_migration
method and drop the move claim (free up tracked resource usage for the source node).Delete migration-based resource allocations against the source compute node resource provider (this can happen in the source compute or the API).
Hard delete the instance and its related records from the source cell database.
Update the
Migration.status
toconfirmed
in the target cell DB.Drop the migration context on the instance in the target cell DB.
Change the instance vm_state to
ACTIVE
orSTOPPED
based on its current power_state in the target cell DB (the user may have manually powered on the guest to verify it before confirming the resize).
Revert flow¶
Similar to the confirm flow, a cross-cell revert resize will be identified
via the Migration.cross_cell_move
field in the API. If True, the API will
RPC cast to a new conductor method revert_cross_cell_resize
which will
execute a new CrossCellRevertResizeTask
. That task will:
Update the instance and its related records in the source cell database based on the contents of the target cell database. This is especially important for things like:
BDMs because you can attach/detach volumes to/from a resized server.
The
REVERT_RESIZE
instance action record created by the API in the target cell. That is needed to track events during the revert in the source cell compute.
Thankfully the API does not allow attaching/detaching ports or changing server tags on a resized server so we do not need to copy those back across to the source cell database.
Mark the source cell DB instance as
hidden=False
to show it from the API while listing servers as we revert.Update the instance mapping to point at the source cell. This needs to happen before spawning in the source cell so that the
network-vif-plugged
event from neutron is routed properly.Mark the target cell DB instance as
hidden=True
to hide it from the API while listing servers as we revert.RPC call the dest compute to terminate the instance (destroy the guest, disconnect volumes and ports, free up tracked resources).
Hard delete the instance and its related records from the target cell database.
Update the
Migration.status
toreverted
in the source cell DB.RPC call the source compute to revert the migration context, apply the old flavor and original image, attach volumes and update port bindings, power on the guest (like in
driver.finish_revert_migration
) and swap source node allocations held by the migration record in placement to the instance record.Note that an alternative to keeping the source disk during resize is to use the snapshot image during revert and just spawn from that (rather than power on from the retained disk). However, that means needing to potentially download the snapshot image back to the source host and ensure the snapshot image is cleaned up for both confirm and revert rather than just at the end of the resize. It would also complicate the ability to recover the guest on the source host by simply hard rebooting it in case the resize fails.
Limitations¶
The _poll_unconfirmed_resizes periodic task, which can be configured to automatically confirm pending resizes on the target host, will not support cross-cell resizes because doing so would require an up-call to the API to confirm the resize and cleanup the source cell database. Orchestrating automatic cross-cell resize confirm could be a new periodic task written in the conductor service as a future enhancement.
Known issues¶
Rather than conductor making synchronous RPC calls during the resize with the
long_rpc_timeout
configuration option, a new option could be added specifically for cross-cell (snapshot-based) resize operations. Given a snapshot of a large disk could take a long time to upload (or download) it might be better to add new options for controlling those timeouts. For the initial version of this feature we will re-uselong_rpc_timeout
and we can add more granular options in the future if necessary.One semantic difference in the API will be different events under the instance actions records during a resize, since the events are created via the
wrap_instance_event
decorator on the compute methods, and when using new methods with new names there will be new events compared to a normal resize. This could maybe be countered by passing a specific name to the decorator rather than just use the function name as it does today. Given there are no API guarantees about the events that show up under an action record, and this has always been internal details that leak out of the API, we will not try to overwrite the new function/event names, e.g. recording acompute_prep_resize
event when calling theprep_snapshot_based_resize_at_dest
method.
Servers created with personality files, commonly known as file injection, that are resized across cells will lose the personality files since they are not persisted in the database. There are two ways to view this. First is that a traditional resize will preserve a config drive with the personality files in it, so this would be a regression from that behavior since the config drive is going to get rebuilt on the destination host during a cross cell resize. On the other hand, servers with personality files that are resized today but do not have a config drive already lose their personality files during the migration because the files are not persisted and therefore even if they get metadata in the guest from the metadata API, they will not get the personality files used during server create (or the last rebuild). Similarly, servers that are evacuated, even if they had a config drive, will lose the personality files during the evacuation since the config drive is rebuilt on the destination host. It is also worth noting that the use of personality files is deprecated.
Edge cases¶
If the user deletes a server in
VERIFY_RESIZE
status, the API confirms the resize to clean up the source host before deleting the server from the dest host [3]. This code will need to take into account a cross-cell resize and cleanup appropriately (cleanup the source host and delete records from the source cell).When routing network events in the API, if the instance has a migration context it will lookup the migration record based on id rather than uuid which may be wrong if the migration context was created in a different cell database where the id primary key on the migration record is different. It is not clear if this will be a problem but it can be dealt with in a few ways:
Store the migration.uuid on the migration context and lookup the migration record using the uuid rather than the id.
When copying the migration context from the target cell DB to the source cell DB, update the
MigrationContext.migration_id
to match theMigration.id
of the source cell migration record.
It is possible to attach/detach volumes to/from a resized server. Because of this, mirroring those block device mapping changes from the target cell DB to the source cell DB during revert adds complication but it is manageable [4]. The ability to do this to resized servers is not well known and arguably may not be officially supported to preserve any volumes attached during the revert, but because that is what works today we should try and support it for cross-cell resize.
Alternatives¶
Lift and shift¶
Users (or cloud operators) could force existing servers to be snapshot, destroyed and then re-created from snapshot with a new flavor in a new cell. It is assumed that deployments already have some kind of tooling like this for moving resources across sites or regions. While normal resize is already disruptive to running workloads, this alternative is especially problematic if specific volumes and ports are attached, i.e. the IP(s) and server UUID would change. In addition, it would require all multi-cell deployments to orchestrate their own cross-cell migration tooling.
Shelve orchestration¶
An alternative design to this spec is found in the PoC [1] and initial version of this spec [2]. That approach opted to try and re-use the existing shelve and unshelve functions to:
Snapshot and shelve offload out of the source cell.
Unshelve from snapshot in the target cell.
On revert, shelve offload from the target cell and then unshelve in the source cell.
The API, scheduler and database manipulation logic was similar except since shelve was used, the instance was offloaded from the source cell which could complicate getting the server back to the original source on revert and require rescheduling to a different host in the source cell.
In addition, that approach resulted in new task states and notifications related to shelve which would not be found in a normal resize, which could be confusing, and complicated the logic in the shelve/unshelve code since it had to deal with resize conditions.
Comparing what is proposed in this spec versus the shelve approach:
Pros:
Arguably cleaner with new methods to control task states and notificiations; no complicated dual-purpose logic to shelve handling a resize, i.e. do not repeat the evacuate/rebuild debt.
The source instance is mostly untouched which should make revert and recover simpler.
Cons:
Lots of new code, some of which is heavily duplicated with shelve/unshelve.
Long-term it should be better to try for a hybrid approach (what is in this spec) to have new compute methods to control notifications and task states to closer match a traditional resize flow, but mix in shelve/unshelve style operations, e.g. snapshot, guest destroy/spawn.
Data model impact¶
A
cross_cell_move
boolean column, which defaults to False, will be added to themigrations
cell DB table and related versioned object.A
hidden
boolean column, which defaults to False, will be added to theinstances
cell DB table and related versioned object.
REST API impact¶
There will be no explicit request/response schema changes to the REST API.
Normal resize semantics like maintaining the same task state transition and
keeping the instance either ACTIVE
or SHUTDOWN
at the end will remain
intact.
While the instance is resized and contains records in both cells, the API will
have to take care to filter out duplicate instance and migration records while
listing those across cells (using the hidden
field).
Security impact¶
As described in the Policy rule section, a new policy rule will be added to control which users can perform a cross-cell resize.
Notifications impact¶
Similar to task state transitions in the API, notifications should remain
the same as much as possible. For example, the Prep Resize at Dest phase
should emit the existing instance.resize_prep.start/end
notifications.
The Prep Resize at Source phase should emit the existing
instance.resize.start/end/error
notifications.
The bigger impact will be to deployments that have a notification queue per cell because the notifications will stop from one cell and start in another, or be intermixed during the resize itself (prep at dest is in target cell while prep at source is in source cell). It is not clear what impact this could have on notification consumers like ceilometer though.
If desired, new versioned notifications (or fields to existing notifications) could be added to denote a cross-cell resize is being performed, either as part of this blueprint or as a future enhancement.
Other end user impact¶
As mentioned above, instance action events and versioned notification behavior may be different.
Performance Impact¶
Clearly a cross-cell resize will perform less well than a normal resize given the database coordination involved and the need to snapshot an image-backed instance out of the source cell and download the snapshot image in the target cell.
Also, deployments which enable this feature may need to scale out their
conductor workers which will be doing a lot of the orchestration work
rather than inter-compute coordination like a normal resize. Similarly, the
rpc_conn_pool_size
may need to be increased because of the synchronous
RPC calls involved.
Other deployer impact¶
Deployers will be able to control who can perform a cross-cell resize in their cloud and also be able to tune parameters used during the resize, like the RPC timeout.
Developer impact¶
A new can_connect_volume
compute driver interface will be added with
the following signature:
def can_connect_volume(self, context, connection_info, instance):
That will be used during the validation step to ensure volumes attached to the instance can connect to the destination host in the target cell. The code itself will be relatively minor and just involve parts of an existing volume attach/detach operation for the driver.
Upgrade impact¶
There are three major upgrade considerations to support this feature.
RPC: given the RPC interface changes to the compute and conductor services, those services will naturally need to be upgraded before a cross-cell resize can be performed.
Cinder: because of the validation relying on volume attachments, cinder will need to be running at least Queens level code with the 3.44 microversion available.
Neutron: because of the validation relying on port bindings, neutron will need to be running at least Rocky level code with the
Port Bindings Extended
API extension enabled.
Implementation¶
Assignee(s)¶
- Primary assignee:
Matt Riedemann <mriedem.os@gmail.com> (irc: mriedem)
- Other contributors:
None
Feature Liaison¶
- Feature liaison:
mriedem
Work Items¶
At a high level this is the proposed series of changes that need to be made in order, although realistically some of the control plane changes could be made in any order as long as the cold migrate task change comes at the end.
DB model changes (
migrations.cross_cell_move
,instances.hidden
).Various versioned objects changes for tracking a cross-cell move in the RequestSpec, looking up a Migration by UUID, creating InstanceAction and InstanceActionEvent records from existing data, etc.
Scheduler changes to select destination hosts from multiple cells during a cross-cell move and weighing them so the “source” cell is preferred by default.
Possible changes to the
MigrationContext
object for new fields likeold_image_ref
,new_image_ref
,old_flavor
,new_flavor
,old_vm_state
(this will depend on implementation).nova-compute RPC interface changes for the prep/validate at dest, prep at source, and finish resize at source operations.
Adding new conductor tasks for orchestrating a cross-cell resize including reverting a resize.
API plumbing changes to handle confirming/reverting a cross-cell resize.
Add the new policy rule and make the existing resize flow use it to tell the scheduler whether or not target hosts can come from another cell, and if the target host is from another cell, to run the new cross-cell resize conductor task to orchestrate the resize rather than the traditional compute-orchestrated flow (where the source and target nova-compute services SSH and RPC between each other).
Dependencies¶
None
Testing¶
The existing functional tests in the PoC change should give a good idea of the types of wrinkles that need to be tested. Several obvious tests include:
Resize both image-backed and volume-backed servers.
Ensure allocations in the placement service, and resource reporting from the
os-hypervisors
API, are accurate at all points of the resize, i.e. while the server is inVERIFY_RESIZE
status, after it is confirmed and reverted.Ensure volume attachments and port bindings are managed properly, i.e. no resources are leaked.
Tags, both on the server and associated with virtual devices (volumes and ports) survive across the resize to the target cell.
Volumes attached/detached to/from a server in
VERIFY_RESIZE
status are managed properly in the case of resize confirm/revert.During a resize, resources which span cells, like the server and its related migration, are not listed with duplicates out of the API.
Perform a resize with at-capacity computes, meaning that when we revert we can only fit the instance with the old flavor back onto the source host in the source cell.
Ensure start/end events/notifications are aligned with a normal same-cell resize.
Resize from both an active and stopped server and assert the original status is retained after confirming and reverting the resize.
Delete a resized server and assert resources and DB records are properly cleaned up from both the source and target cell.
Test a failure scenario where the server is recovered via rebuild in the source cell.
Unit tests will be added for the various units of changes leading up to the end of the series where the functional tests cover the integrated flows. Negative/error/rollback scenarios will also be covered with unit tests and functional tests as appropriate.
Since there are no direct API changes, Tempest testing does not really fit this change. However, something we should really have, and arguably should have had since Pike, is a multi-cell CI job. Details on how a multi-cell CI job can be created though is unclear given the need for it to either integrate with legacy devstack-gate tooling or, if possible, new zuul v3 tooling.
Documentation Impact¶
The compute admin resize guide will be updated to document cross-cell resize in detail from an operations perspective, including troubleshooting and fault recovery details.
The compute configuration guide will be updated for the new policy rule and any configuration options added.
The compute server concepts guide may also need to be updated for any user-facing changes to note, like the state transitions of a server during a cross-cell resize.
References¶
Stein PTG discussions:
Mailing list discussions:
http://lists.openstack.org/pipermail/openstack-dev/2018-August/thread.html#133693
http://lists.openstack.org/pipermail/openstack-operators/2018-August/thread.html#15729
Code:
https://review.openstack.org/#/q/topic:bp/cross-cell-resize+(status:open+OR+status:merged)
History¶
Release Name |
Description |
---|---|
Stein |
Introduced |
Train |
Re-proposed and added the known issue for personality files and details hard deleting the instance and its related records from a cell DB. |
Ussuri |
Re-proposed |