.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ============================================================= Cinder Volume Active/Active support - Cleanup ============================================================= https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support Right now cinder-volume service can run only in Active/Passive HA fashion. One of the reasons for this is that we have no concept of a cluster of nodes that handle the same storage back-end, and we assume only one volume service can access a specific storage back-end. Given this premise, current code handles the cleanup for failed volume services as if no other service is working with resources from his back-end, and that is problematic when there are other volume services working with those resources, as is the case on an Active/Active configuration. This spec introduces a new cleanup mechanism and modifies current cleanup mechanism so proper cleanup is done regardless of cinder configuration, Active/Passive or Active/Active. Problem description =================== Current Cinder code only supports Active/Passive configurations, so the cleanup takes that into account and cleans up resources from ongoing operations accordingly, but that is incompatible with an Active/Active deployment. The incompatibility comes from the fact that volume services on startup look on the DB for resources that are in the middle of an operation and are from their own storage back-end - detected by the ``host`` field - and proceed to clean them up depending on the state they are in. For example a ``downloading`` volume will be changed to ``error`` since the download was interrupted and we cannot recover from it. With the new job distribution mechanism the ``host`` field will contain the host configuration of the volume service that created the resource, but that resource may now be in use by another volume service from the same cluster, so we cannot just rely on this ``host`` field for cleanup, as it may lead to cleaning wrong resources or skipping the ones we should be cleaning. When we are working with an Active/Active system we cannot just clean all resources from our storage-backend that are in an ongoing state, since they may be legitimate undergoing jobs being handled by other volume services. We are going to forget for a moment how we are doing the cleanup right now and focus on the different cleanup scenarios we have to cover. One is when a volume service "dies" -by that we mean that it really stops working, or it is fenced- and failover boots another volume service to replace it as if it were the same service -having the same ``host`` and ``cluster`` configurations-, and the other scenario is when the service dies and no other service takes its place, or the service that takes its place shares the ``cluster`` configuration but has a different ``host``. Those are the cases we have to solve to be able to support Active/Active and Active/Pasive configurations with proper cleanups. Use Cases ========= Operators that have hard requirements, SLA or other reasons, to have their cloud operational at all times or have higher throughput requirements will want to have the possibility to configure their deployments with an Active/Active configuration and have proper cleanup of resources when services die. Proposed change =============== Since checking for the status and the ``host`` field of the resource is no longer enough to know if it needs cleanup -because the ``host`` field will be referring to the ``host`` configuration of the volume service that created the resource and not the owner of the resource as explained in the `Job Distribution`_ specs- we will create a new table to track which service from the cluster is working on each resource. We'll call this new table ``workers`` and it will include all resources that are being processed with cleanable operations, and therefore would require cleanup if the service that is doing the operation crashed. When a cleanable job is requested by the API or any of the services -for example a volume deletion can be requested by the API or by the c-vol service during a migration- we will create a new row in the ``workers`` table with the resource we are working on and who is working on it. And once the operation has been completed -successfully or unsuccessfully- this row will be deleted to indicate processing has concluded and a cleanup will no longer be needed if the service dies. We will not be adding a row for non cleanable operations and resources that are used in cleanable operations but won't require cleanup, as this would create a significant increase in DB operations that would end up affecting performance of all operations. These ``workers`` rows serve as *flags* for the cleanup mechanism to know it must check that resource in case of a crash and see if it needs cleanup. There can only exist 1 cleanable operation at a time for a given resource. To ensure that both scenarios mentioned above are taken care of, we will have cleanup code on cinder-volume and Scheduler services. Cinder-volume service cleanups will be similar to the ones we currently have on startup -``init_host`` method- but with small modifications to use the ``workers_table`` so services can tell which resources require cleanup because they were left in the middle of an operation. With this we take care of one of the scenarios, but we still have to consider the case where no replacement volume service comes up with the same ``host`` configuration, and for that we will add a mechanism on the scheduler that will take care of requesting other volume service from the cluster, that manage the same backend, to do the cleaning for the fallen service. The cleanup mechanism implemented on the scheduler will have manual and automatic options, manual option will require the caller to specify which services should be cleaned up using filters, and automatic operation will let the scheduler decide which services should be cleaned up based on their status and how long they have been down. Automatic cleanup mechanism will consist of a periodic task that will sample services that are down, with a frequency of ``service_down_time`` seconds, and will proceed to clean up resources that were left by those services that are down after ``auto_cleanup_checks`` x ``service_down_time`` seconds have passed since the service went down. Since we can have multiple Scheduler services and the cinder-volume service all trying to do the cleanup simultaneously, code needs to be able handle these situations. On one hand, to prevent multiple Schedulers from cleaning the same services's resources they will be reporting all automatic cleanup operations requested to the cinder-volumes to the other Scheduler services and will ask other scheduler services which services have already been cleaned on service start. On the other hand, to prevent cleanup concurrency issues if a cleanup is requested on a service that is already being cleaned up, we will issue all cleanup operations with a timestamp indicating that only ``workers`` entries before that should be cleaned up, so when a service starts doing the cleanup for a resource it updates the entry an prevents additional cleanup operations on the resource. Row deletion operations in ``workers`` table will be a real deletions in the DB, not soft deletes like we do for other tables, because the number of operations, and therefore of rows, will be quite high and because we will be setting constraints on the rows that would not hold true if we had the same resource multiple times (there are workarounds, but it doesn't seem to be worth it). Since these will be big, complex changes, we will not be enabling any kind of automatic cleanup by default, and it will need to be either enabled in the configuration using ``auto_cleanup_enabled`` option or triggered using the manual cleanup API -using filters- or the automatic cleanup API. It will be possible to trigger the automatic cleanup mechanism via the API even when it is disabled, as the disabling only prevents it from being automatically triggered. It is important to mention that using "reset-state" operation on any resource will remove any existing ``workers`` table entry in the DB. When proceeding with a cleanup we will ensure that no other service is working on that resource (claiming the ``worker``'s entry) and that the data on the ``workers`` entry is still valid for the given resource (status matches) since a user may have forcefully issued another action on the resource in the meantime.. Alternatives ------------ There are multiple alternatives to proposed change, the most appealing ones are: - Use Tooz with a DLM that allows Leader Election to prevent more than one scheduler from doing cleanup of down services. Downsides to this solution are considerable: - Increased dependency on a DLM. - Limiting DLM choices since now it needs to have Leader Election functionality. - We will still need to let other schedulers know when the leader does cleanups because when electing a new leader will need this information to determine if down services have already been cleaned. - Create ``workers`` DB entries for every operation on a resource. Disadvantages of this alternative are: - Considerable performance impact. - Greatly increase cleanup mechanism complexity, as we would need to mark all entries as being processed by the service we are going to clean (this has its own complexity because multiple schedulers could be requesting it or a scheduler and the service itself), then see which of those resources would require cleanup according to the ``workers`` table and check if no other service is already working on that resource because a user decided to do a cleanup on his own (for example a force delete on a deleting resource) and if there's no other service working on the resource and the resource has a status that is cleanable, then do the cleanup. Doing all this without races is quite complicated. Data model impact ----------------- Create new `workers` table with following fields: - ``id``: To uniquely identify each entry and speed up some operations - ``created_at``: To mark when the job was started at the API - ``updated_at``: To mark when the job was last touched (API, SCH, VOL) - ``deleted_at``: Will not be used - ``resource_type``: Resource type (Volume, Backup, Snapshot...) - ``resource_id``: UUID of the resource - ``status``: The status that should be cleaned on service failure - ``service_id``: service working on the resource REST API impact --------------- Two new admin only API endpoint will be created, ``/workers/cleanup`` and ``/workers/auto_cleanup``. For ``/workers/cleanup`` endpoint we will be able to supply filtering parameters, but if no arguments are provided cleanup will issue a clean message for all services that are down. But we can restrict which services we want to be cleaned using parameters `service_id`, `cluster_name`, `host`, `binary`, `disabled`. Cleaning specific resources is also possible using `resource_type` and `resource_id` parameters. Cleanup cannot be triggered during a cloud upgrade, but a restarted service will still cleanup it's own resources during an upgrade. Both API endpoints will return a dictionary with 2 lists, one with services that have been issued a cleanup request (`cleaning`) and another list with services that cannot be cleaned right now because there is no alternative service to do the cleanup in that cluster (`unavailable`), that way the caller can know which services will be cleaned up. Data returned for each service in the lists are `id`, `name`, and `state` fields. Security impact --------------- None Notifications impact -------------------- None Other end user impact --------------------- None Performance Impact ------------------ Small impact on cleanable operations since we have to use the ``workers`` table to *flag* that we are working on the resource. Other deployer impact --------------------- None Developer impact ---------------- Any developer that wants to add new resources requiring cleanup or wants add cleanup for the status -new or existing- of an existing resource will have to use the new mechanism to mark the resource as cleanable, add which states are cleanable, and add the cleanup code. Implementation ============== Assignee(s) ----------- Primary assignee: Gorka Eguileor (geguileo) Other contributors: Michal Dulko (dulek) Anyone is welcome to help Work Items ---------- - Make DB changes to add the new ``workers`` table. - Implement adding rows to ``workers`` table. - Change ``host_init`` to use an RPC call for the cleanup. - Modify Scheduler code to do cleanups. - Create devref explaining requirements to add cleanup resources/statuses. Dependencies ============ `Job Distribution`_: - This depends on the job distribution mechanism so the cleanup can be done by any available service from the same cluster. Testing ======= Unittests for new cleanup behavior. Documentation Impact ==================== Document new configuration option ``auto_cleanup_enabled`` and ``auto_cleanup_checks`` as well as the cleanup mechanism. Document behavior of reset-state on Active-Active deployment. References ========== General Description for HA A/A: https://review.openstack.org/232599 Job Distribution for HA A/A: https://review.openstack.org/327283 .. _Job Distribution: https://review.openstack.org/327283