In Pike, we realized that there is a gap in the way we were planning to handle allocations for instances during move operations. In order to avoid releasing resources on the source node in order to claim against the target node, we need a placeholder or temporary owner. Currently, we solve this by allocating against the source and destination nodes for the instance, so it looks like the instance is using resources on both at the same time. This works, but it makes it harder to determine which node owns the process of releasing the other half of this “double” allocation during success and failure scenarios. Things are further complicated by the potential for same-host migrations.
The problem we have is that currently the scheduler must know that we’re performing a move operation, and must add an allocation for the instance against the target node, leaving the existing allocation against the source node. After a successful or failed migration, one of the compute nodes must clean up the correct half of the doubled allocation to avoid the instance continuing to consume more than one spot.
As an operator, I want proper resource accounting during a move operation to make sure compute nodes don’t become overcommitted while instances are moving.
As a nova developer, I want a clear assignment of responsibilities for each allocation in placement so that the code to allocate and deallocate for an instance is simple and straightforward.
The overall proposed change is to let the migration operation itself, as identified by a migration record in the database, “own” one of the two allocations necessary to reserve resources. Instead of trying to have the instance be the consumer of two sets of allocations against the two nodes involved (source and destination), we can let the instance own one and the migration own the other.
In order to do this, the migration record must have a uuid identifier, which is the first change that needs to be made.
Once we have that, we will change the existing migration code to replace the allocation that the instance has against the source node with an identical one with the migration as the owner. Next, we allocate for the instance (new flavor if resizing) against the destination node. If the migration is successful, we can simply delete the migration-owned allocation against the source node when the operation is complete (or confirmed). Upon failure, we do the opposite, deleting the target allocation and replacing the source allocation with one owned by the instance.
The benefit here is that instead of trying to double-allocate for an instance, and then having to subtract the source node/flavor from that allocation on success, we can simply operate on allocations atomically (i.e. create/delete) as the design intends. This makes the math and mechanics the same for single and multi-host move operations, and avoids one compute node manipulating allocations against another.
There is another major issue with the code as it stands today, in the case of a single-host resize. Placement has a notion of a max_unit for a given resource, which defines the upper limit of that resource that any one consumer may allocate. If our allocation for, say, VCPU is more than half of the max_unit, then a single-host resize will end up attempting to allocate more than the max_unit for the summed-during-resize allocation and will fail. The proposed change will end up with the migration owning the original allocation and the instance owning the new one, which will work because they are separate allocations.
When we discovered this issue late in Pike, we implemented the primary alternative approach because it was less disruptive. That option is to replace the allocation for the instance against the source node with one against both the source and destination nodes during the operation. This is still an option, but in practice the math (especially for single-host moves) is vague and imprecise, and the ownership responsibility for each compute node is obscure.
Data model impact¶
The primary impact here is adding a uuid to the migration object which can be used as the consumer for the source allocation in placement.
REST API impact¶
No direct API impact
Other end user impact¶
If there are custom tools developed to read usage information out of placement, then there could be some user-visible change between pike and queens. This would be in the realm of a pike deployment showing a large intermediate allocation/usage by the instance, which won’t happen after this change unless the tool takes migration-owned allocations into account.
Other deployer impact¶
As mentioned above, deployers could see some impact here if they have written custom tools against placement for data gathering. However, it seems unlikely at this point.
Developer impact of this will be overwhelmingly positive once the initial complexity of handling the migration of the “pike way” to the “queens way”. Ownership of the allocations will be clear and concise, and the code required for cleanup after a migration will be significantly simpler.
- Primary assignee:
Add a uuid to the migration record
Migrate existing/outstanding migration records to give them a uuid
Make the compute node code able to handle either the doubled allocation strategy (pike) or the split allocation strategy (queens)
Make the scheduler create allocations using either strategy, determined by whether or not there are pike nodes in the deployment
To optimize our behavior, we need an additional API in placement to allow us to submit multiple allocations for different consumers in a single atomic operation. See https://blueprints.launchpad.net/nova/+spec/post-allocations for the related work.
Ideally we would expose the migration uuid from the os-migrations API, in case admins need to be able to correlate the instance with its migration allocation for external tooling or auditing.
As part of the fire drill at the end of pike, we now have a fairly comprehensive set of functional tests that verify the allocation behavior during a migration. These should outline the coverage we need, although the expected behavior at each point will be different. These tests could easily be duplicated and preserved for testing pike behavior, and then the original tests can be modified to verify queens behavior. Once we’re past queens we can delete the pike behavior tests when we drop that code.
Since this should ideally be invisible from the outside, no documentation impact is expected.
See the mess in pike.