Allow abort live migrations in queued status

https://blueprints.launchpad.net/nova/+spec/abort-live-migration-in-queued-status

This blueprint adds support to allow abort live migrations in queued status.

Problem description

The functionality of abort live migration was added in microversion 2.24 [1], and currently only migrations in running status are allowed to be aborted.

There is a config option max_concurrent_live_migrations that can be used to control the max number of concurrent live migrations, the default value is 1. When the number of live migration requests could be greater than the max concurrent live migration configuration, there will be migrations wait in queue. The migrations could remain in queued status for a very long time depend on the queue length and the processing speed.

Admins may want to abort migrations in queue due to time consumption considerations etc. It will be unreasonable to make admins wait until the status turn to running before they can be aborted.

Use Cases

Migrations could be stuck in queued status for a very long time because of the migration queue length and processing speed. Admins may want to abort migrations in queue due to time consumption considerations etc.

Proposed change

The whole change will be divided into two steps:

Step1 - Fix the problem of lack of queue

In the current implementation, the code that serializes the live migrations on compute node uses a python semaphore, the value of the semaphore is set to be CONF.max_concurrent_live_migrations, each incoming migration will try to acquire this semaphore, if the acquire succeed, the value of the semaphore will decrease by one, and the status of the migration will turn to status other than queued. When the value decreased to 0, new incoming migrations will be blocked(migration status will be queued) until some of the previous migration was finished(succeed, failed or aborted) and releases the semaphore.

According to the above mentioned implementation, it is unable to abort a migration in queued status as there is actually no QUEUE, so we are not able to control the migrations blocked by the semaphore.

This spec will propose a design that can achieve the above mentioned goal:

  • Using ThreadPoolExecutor from concurrent.futures lib instead of the current eventlet.spawn_n() + python Semaphore implementation. The size of the Thread Pool will be limited by CONF.max_concurrent_live_migrations. When a live migration request comes in, we submit the _do_live_migration calls to the pool, and it will return a Future object, we will use that later. If the pool is full, the new request will be blocked and kept in queued status.

  • Add a new _waiting_live_migration variable to the ComputeManager class of the compute node, this will be a dict, and will be initialized as an empty dict. We will:

    1. Record the connection between migration_uuid and the Future object when the thread is created in previous step, we will use migration_uuid as key and Future object as value in our dict.

    2. Remove the corresponding key/value the first thing if the thread successfully acquired the executor and enter _do_live_migration() [2]. In this way, we will have a queue-like thing to store Futures and make it possible to get them by migration_uuid.

Step2 - Allow abort live migrations in queued status

After the modification proposed in step 1, we will be able to get threads blocked by migration_uuid and then we can abort them:

  • First check whether the provided migration_uuid is in the _waiting_live_migration dict or not, if it is not in, then it will be in status other than queued, we can switch to the workflow as is today.

  • If the provided migration_uuid is in _waiting_live_migration dict then get the corresponding Future object and call cancel() method of the ThreadPoolExecutor.

  • If the cancel call succeed, we perform roll back and clean ups for the migration in queued status. The cancel call will return False if the provided Future object is currently executing, which means the provided thread is no longer blocked, so we can switch to the workflow of abort migration in running status as is today.

  • Add an API microversion to DELETE /servers/{id}/migrations/{migration_id} API to allow abort live migration in queued status. If the microversion of the request is equal or beyond the newly added microversion, API will check the instance.host's nova-compute service version before making RPC call and make sure it is new enough for the new support, if not, API will still return 400 as today.

  • The rpcapi interface will be modified to take migration object as parameter thus we can make decision whether we can send rpc calls depend on target compute version and migration status, we will still send migration.id in the rpc call.

  • We will also add a cleanup to the pool when the compute manager is shutting down. This part will be a trial-and-error during the implementation as there are still some details to be figure out. The principle here is that we don’t want to block the shutdown of the service on queued migrations, so we want to set those migrations to cancelled status, cancel() the queue Future so the pool shutdown does not block on it. The steps during cleanup_host are:

    1. Shutdown the pool so we don’t get new requests

    2. For any queued migrations, set the migration status to cancelled

    3. Cancel the future using Future.cancel()

    Step 2 and 3 might be interchangeable, we will find out the best order in implementation.

Alternatives

None

Data model impact

None

REST API impact

The proposal would add API microversion to DELETE /servers/{id}/migrations/{migration_id} API to allow abort live migration in queued status. When request with API microversion larger than the newly added microversion, the response will change from HTTP 400 BadRequest to HTTP 202 Accepted if the status of requested live migration is in queued status.

Security impact

None

Notifications impact

None

Other end user impact

Python-novaclient will be modified to handle the new microversion to allow abort live migrations in queued status.

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

Compute API will still return 400 for trying abort a migration in queued state if the compute service that the instance is running on is too old.

Implementation

Assignee(s)

Primary assignee:

Zhenyu Zheng

Work Items

  • Convert compute manager to queue migrations with threads/futures

  • Create a new API microversion to allow abort live migrations in queued status.

  • Modify the rpcapi interface to take migration object as parameter thus we can make decision whether we can send rpc calls depend on target compute version and migration status.

  • Modify the python-novaclient to handle the new microversion.

Dependencies

None

Testing

Would need new in-tree functional and unit tests.

Documentation Impact

Docs needed for new API microversion and usage.

References

History

Revisions

Release Name

Description

Rocky

Proposed