Cinder Volume Active/Active support - Cleanup¶
Right now cinder-volume service can run only in Active/Passive HA fashion.
One of the reasons for this is that we have no concept of a cluster of nodes that handle the same storage back-end, and we assume only one volume service can access a specific storage back-end.
Given this premise, current code handles the cleanup for failed volume services as if no other service is working with resources from his back-end, and that is problematic when there are other volume services working with those resources, as is the case on an Active/Active configuration.
This spec introduces a new cleanup mechanism and modifies current cleanup mechanism so proper cleanup is done regardless of cinder configuration, Active/Passive or Active/Active.
Current Cinder code only supports Active/Passive configurations, so the cleanup takes that into account and cleans up resources from ongoing operations accordingly, but that is incompatible with an Active/Active deployment.
The incompatibility comes from the fact that volume services on startup look on
the DB for resources that are in the middle of an operation and are from their
own storage back-end - detected by the
host field - and proceed to clean
them up depending on the state they are in. For example a
volume will be changed to
error since the download was interrupted and we
cannot recover from it.
With the new job distribution mechanism the
host field will contain the
host configuration of the volume service that created the resource, but that
resource may now be in use by another volume service from the same cluster, so
we cannot just rely on this
host field for cleanup, as it may lead to
cleaning wrong resources or skipping the ones we should be cleaning.
When we are working with an Active/Active system we cannot just clean all resources from our storage-backend that are in an ongoing state, since they may be legitimate undergoing jobs being handled by other volume services.
We are going to forget for a moment how we are doing the cleanup right now and
focus on the different cleanup scenarios we have to cover. One is when a
volume service “dies” -by that we mean that it really stops working, or it is
fenced- and failover boots another volume service to replace it as if it were
the same service -having the same
cluster configurations-, and
the other scenario is when the service dies and no other service takes its
place, or the service that takes its place shares the
but has a different
Those are the cases we have to solve to be able to support Active/Active and Active/Pasive configurations with proper cleanups.
Operators that have hard requirements, SLA or other reasons, to have their cloud operational at all times or have higher throughput requirements will want to have the possibility to configure their deployments with an Active/Active configuration and have proper cleanup of resources when services die.
Since checking for the status and the
host field of the resource is no
longer enough to know if it needs cleanup -because the
host field will be
referring to the
host configuration of the volume service that created the
resource and not the owner of the resource as explained in the Job
Distribution specs- we will create a new table to track which service from
the cluster is working on each resource.
We’ll call this new table
workers and it will include all resources that
are being processed with cleanable operations, and therefore would require
cleanup if the service that is doing the operation crashed.
When a cleanable job is requested by the API or any of the services -for
example a volume deletion can be requested by the API or by the c-vol service
during a migration- we will create a new row in the
workers table with the
resource we are working on and who is working on it. And once the operation
has been completed -successfully or unsuccessfully- this row will be deleted to
indicate processing has concluded and a cleanup will no longer be needed if the
We will not be adding a row for non cleanable operations and resources that are used in cleanable operations but won’t require cleanup, as this would create a significant increase in DB operations that would end up affecting performance of all operations.
workers rows serve as flags for the cleanup mechanism to know it
must check that resource in case of a crash and see if it needs cleanup. There
can only exist 1 cleanable operation at a time for a given resource.
To ensure that both scenarios mentioned above are taken care of, we will have cleanup code on cinder-volume and Scheduler services.
Cinder-volume service cleanups will be similar to the ones we currently have on
init_host method- but with small modifications to use the
workers_table so services can tell which resources require cleanup because
they were left in the middle of an operation. With this we take care of one of
the scenarios, but we still have to consider the case where no replacement
volume service comes up with the same
host configuration, and for that we
will add a mechanism on the scheduler that will take care of requesting other
volume service from the cluster, that manage the same backend, to do the
cleaning for the fallen service.
The cleanup mechanism implemented on the scheduler will have manual and automatic options, manual option will require the caller to specify which services should be cleaned up using filters, and automatic operation will let the scheduler decide which services should be cleaned up based on their status and how long they have been down.
Automatic cleanup mechanism will consist of a periodic task that will sample
services that are down, with a frequency of
service_down_time seconds, and
will proceed to clean up resources that were left by those services that are
service_down_time seconds have passed
since the service went down.
Since we can have multiple Scheduler services and the cinder-volume service all trying to do the cleanup simultaneously, code needs to be able handle these situations.
On one hand, to prevent multiple Schedulers from cleaning the same services’s resources they will be reporting all automatic cleanup operations requested to the cinder-volumes to the other Scheduler services and will ask other scheduler services which services have already been cleaned on service start.
On the other hand, to prevent cleanup concurrency issues if a cleanup is
requested on a service that is already being cleaned up, we will issue all
cleanup operations with a timestamp indicating that only
before that should be cleaned up, so when a service starts doing the cleanup
for a resource it updates the entry an prevents additional cleanup operations
on the resource.
Row deletion operations in
workers table will be a real deletions in the
DB, not soft deletes like we do for other tables, because the number of
operations, and therefore of rows, will be quite high and because we will be
setting constraints on the rows that would not hold true if we had the same
resource multiple times (there are workarounds, but it doesn’t seem to be worth
Since these will be big, complex changes, we will not be enabling any kind of
automatic cleanup by default, and it will need to be either enabled in the
auto_cleanup_enabled option or triggered using the
manual cleanup API -using filters- or the automatic cleanup API.
It will be possible to trigger the automatic cleanup mechanism via the API even when it is disabled, as the disabling only prevents it from being automatically triggered.
It is important to mention that using “reset-state” operation on any resource
will remove any existing
workers table entry in the DB.
When proceeding with a cleanup we will ensure that no other service is working
on that resource (claiming the
worker’s entry) and that the data on the
workers entry is still valid for the given resource (status matches) since
a user may have forcefully issued another action on the resource in the
There are multiple alternatives to proposed change, the most appealing ones are:
Use Tooz with a DLM that allows Leader Election to prevent more than one scheduler from doing cleanup of down services. Downsides to this solution are considerable:
Increased dependency on a DLM.
Limiting DLM choices since now it needs to have Leader Election functionality.
We will still need to let other schedulers know when the leader does cleanups because when electing a new leader will need this information to determine if down services have already been cleaned.
workersDB entries for every operation on a resource. Disadvantages of this alternative are:
Considerable performance impact.
Greatly increase cleanup mechanism complexity, as we would need to mark all entries as being processed by the service we are going to clean (this has its own complexity because multiple schedulers could be requesting it or a scheduler and the service itself), then see which of those resources would require cleanup according to the
workerstable and check if no other service is already working on that resource because a user decided to do a cleanup on his own (for example a force delete on a deleting resource) and if there’s no other service working on the resource and the resource has a status that is cleanable, then do the cleanup. Doing all this without races is quite complicated.
Data model impact¶
Create new workers table with following fields:
id: To uniquely identify each entry and speed up some operations
created_at: To mark when the job was started at the API
updated_at: To mark when the job was last touched (API, SCH, VOL)
deleted_at: Will not be used
resource_type: Resource type (Volume, Backup, Snapshot…)
resource_id: UUID of the resource
status: The status that should be cleaned on service failure
service_id: service working on the resource
REST API impact¶
Two new admin only API endpoint will be created,
/workers/cleanup endpoint we will be able to supply filtering
parameters, but if no arguments are provided cleanup will issue a clean message
for all services that are down. But we can restrict which services we want to
be cleaned using parameters service_id, cluster_name, host, binary,
Cleaning specific resources is also possible using resource_type and resource_id parameters.
Cleanup cannot be triggered during a cloud upgrade, but a restarted service will still cleanup it’s own resources during an upgrade.
Both API endpoints will return a dictionary with 2 lists, one with services that have been issued a cleanup request (cleaning) and another list with services that cannot be cleaned right now because there is no alternative service to do the cleanup in that cluster (unavailable), that way the caller can know which services will be cleaned up.
Data returned for each service in the lists are id, name, and state fields.
Other end user impact¶
Small impact on cleanable operations since we have to use the
to flag that we are working on the resource.
Other deployer impact¶
Any developer that wants to add new resources requiring cleanup or wants add cleanup for the status -new or existing- of an existing resource will have to use the new mechanism to mark the resource as cleanable, add which states are cleanable, and add the cleanup code.
- Primary assignee:
Gorka Eguileor (geguileo)
- Other contributors:
Michal Dulko (dulek) Anyone is welcome to help
Make DB changes to add the new
Implement adding rows to
host_initto use an RPC call for the cleanup.
Modify Scheduler code to do cleanups.
Create devref explaining requirements to add cleanup resources/statuses.
- Job Distribution:
This depends on the job distribution mechanism so the cleanup can be done by any available service from the same cluster.
Unittests for new cleanup behavior.
Document new configuration option
auto_cleanup_checks as well as the cleanup mechanism.
Document behavior of reset-state on Active-Active deployment.
General Description for HA A/A: https://review.openstack.org/232599
Job Distribution for HA A/A: https://review.openstack.org/327283