Allow abort live migrations in queued status¶
https://blueprints.launchpad.net/nova/+spec/abort-live-migration-in-queued-status
This blueprint adds support to allow abort live migrations in queued
status.
Problem description¶
The functionality of abort live migration was added in microversion 2.24 [1],
and currently only migrations in running
status are allowed to be
aborted.
There is a config option max_concurrent_live_migrations
that can be used
to control the max number of concurrent live migrations, the default value
is 1. When the number of live migration requests could be greater than the
max concurrent live migration configuration, there will be migrations wait
in queue. The migrations could remain in queued
status for a very long
time depend on the queue length and the processing speed.
Admins may want to abort migrations in queue due to time consumption
considerations etc. It will be unreasonable to make admins wait until
the status turn to running
before they can be aborted.
Use Cases¶
Migrations could be stuck in queued
status for a very long time
because of the migration queue length and processing speed. Admins
may want to abort migrations in queue due to time consumption considerations
etc.
Proposed change¶
The whole change will be divided into two steps:
Step1 - Fix the problem of lack of queue¶
In the current implementation, the code that serializes the live migrations
on compute node uses a python semaphore, the value of the semaphore is set
to be CONF.max_concurrent_live_migrations
, each incoming migration will
try to acquire this semaphore, if the acquire succeed, the value of the
semaphore will decrease by one, and the status of the migration will turn
to status other than queued
. When the value decreased to 0, new incoming
migrations will be blocked(migration status will be queued
) until some of
the previous migration was finished(succeed, failed or aborted) and releases
the semaphore.
According to the above mentioned implementation, it is unable to abort a
migration in queued
status as there is actually no QUEUE, so we are
not able to control the migrations blocked by the semaphore.
This spec will propose a design that can achieve the above mentioned goal:
Using
ThreadPoolExecutor
fromconcurrent.futures
lib instead of the currenteventlet.spawn_n()
+ pythonSemaphore
implementation. The size of the Thread Pool will be limited byCONF.max_concurrent_live_migrations
. When a live migration request comes in, we submit the_do_live_migration
calls to the pool, and it will return aFuture
object, we will use that later. If the pool is full, the new request will be blocked and kept inqueued
status.Add a new
_waiting_live_migration
variable to theComputeManager
class of the compute node, this will be a dict, and will be initialized as an empty dict. We will:Record the connection between
migration_uuid
and theFuture
object when the thread is created in previous step, we will usemigration_uuid
as key andFuture
object as value in our dict.Remove the corresponding key/value the first thing if the thread successfully acquired the executor and enter
_do_live_migration()
[2]. In this way, we will have a queue-like thing to store Futures and make it possible to get them bymigration_uuid
.
Step2 - Allow abort live migrations in queued status¶
After the modification proposed in step 1, we will be able to get threads
blocked by migration_uuid
and then we can abort them:
First check whether the provided
migration_uuid
is in the_waiting_live_migration
dict or not, if it is not in, then it will be in status other thanqueued
, we can switch to the workflow as is today.If the provided
migration_uuid
is in_waiting_live_migration
dict then get the correspondingFuture
object and callcancel()
method of theThreadPoolExecutor
.If the cancel call succeed, we perform roll back and clean ups for the migration in
queued
status. The cancel call will returnFalse
if the providedFuture
object is currently executing, which means the provided thread is no longer blocked, so we can switch to the workflow of abort migration inrunning
status as is today.Add an API microversion to
DELETE /servers/{id}/migrations/{migration_id}
API to allow abort live migration inqueued
status. If the microversion of the request is equal or beyond the newly added microversion, API will check theinstance.host's
nova-compute service version before making RPC call and make sure it is new enough for the new support, if not, API will still return 400 as today.The rpcapi interface will be modified to take migration object as parameter thus we can make decision whether we can send rpc calls depend on target compute version and migration status, we will still send migration.id in the rpc call.
We will also add a cleanup to the pool when the compute manager is shutting down. This part will be a trial-and-error during the implementation as there are still some details to be figure out. The principle here is that we don’t want to block the shutdown of the service on queued migrations, so we want to set those migrations to
cancelled
status, cancel() the queue Future so the pool shutdown does not block on it. The steps during cleanup_host are:Shutdown the pool so we don’t get new requests
For any queued migrations, set the migration status to
cancelled
Cancel the future using Future.cancel()
Step 2 and 3 might be interchangeable, we will find out the best order in implementation.
Alternatives¶
None
Data model impact¶
None
REST API impact¶
The proposal would add API microversion to
DELETE /servers/{id}/migrations/{migration_id}
API to allow abort live
migration in queued
status. When request with API microversion larger
than the newly added microversion, the response will change from
HTTP 400 BadRequest
to HTTP 202 Accepted
if the status of requested
live migration is in queued
status.
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
Python-novaclient will be modified to handle the new microversion to
allow abort live migrations in queued
status.
Performance Impact¶
None
Other deployer impact¶
None
Developer impact¶
None
Upgrade impact¶
Compute API will still return 400 for trying abort a migration in queued state if the compute service that the instance is running on is too old.
Implementation¶
Assignee(s)¶
- Primary assignee:
Zhenyu Zheng
Work Items¶
Convert compute manager to queue migrations with threads/futures
Create a new API microversion to allow abort live migrations in
queued
status.Modify the rpcapi interface to take migration object as parameter thus we can make decision whether we can send rpc calls depend on target compute version and migration status.
Modify the python-novaclient to handle the new microversion.
Dependencies¶
None
Testing¶
Would need new in-tree functional and unit tests.
Documentation Impact¶
Docs needed for new API microversion and usage.
References¶
History¶
Release Name |
Description |
---|---|
Rocky |
Proposed |