Cinder Volume Active/Active support - Replication¶
As it stands to reason replication v2.1 only works in deployment configurations that were available and supported in Cinder at the time of its design and implementation.
Now that we are also supporting Active-Active configurations this translates to replication not properly working on this new supported configuration.
This spec extends replication v2.1 functionality to support Active-Active configurations while preserving backward compatibility for non clustered configurations.
On replication v2.1 failover is requested on a per backend basis, so when a failover request is received by the API it is then redirected to a specific volume service via an asynchronous RPC call using that service’s topic message queue. Same thing happens for freeze and thaw operations.
It works when we have a one-to-one relation between volume services and storage backends, but it doesn’t when we have many-to-one relationship because the failover RPC call will be received by only one of the services that form the cluster for the storage backend and the others will be oblivious to this change and will continue using the same replication site they had been using before. This will result in some operations succeeding, those going to the service that performed the failover, and some operations failing, since they are going to the site that’s not available.
While that’s the primary issue, it’s not the only one, since we also have to track the replication status at the cluster level.
Users want to have highly available cinder services with disaster recovery using replication.
It is not enough that individual features will be available on their own as they’ll want to have them both at the same time; so being able to use either Active-Active configurations without replication, or replication if not deployed as Active-Active, is insufficient.
They could probably make it work if they stopped all but one volume services in the cluster, issued the failover request, and once it has been completed they brought the other services back up, but this would not be a clean approach to the problem.
The proposed change in its core is to divide the failover operation in the driver into two individual operations, one that will do the side of things related with the storage backend, for example force promoting volumes to primary on the secondary site, and another that will make the driver perform all the operations against the secondary storage device.
As mentioned before only one volume service will receive the request to do the failover, so by splitting the operation the manager will be able to request the local driver to do the first part of the failover and once that is done it will send all volume nodes in the cluster handling that backend the signal that the failover has been completed and that they should start pointing to the failed over secondary site, thus solving the problem of some services not knowing that a new site should be used.
This will also require two homonymous RPC calls to the drivers new methods in
the volume manager:
We will also add the replication information to the
clusters table to track
replication at the cluster level for clustered services.
Given current use of the freeze and thaw operation there doesn’t seem to be a reason to do the same split, so these operations would be left as they are and will only be performed by one volume service when requested.
This change will require vendors to update their drivers to support replication on Active-Active configurations, so to avoid surprises we will be preventing the volume service from starting in Active-Active configurations with replication enabled on drivers that don’t support the Active-Active mechanism.
The splitting mechanism for the
failover_host method is pretty straight
forward, the only alternative to the proposed changed would be to split the
thaw and freeze operations as well.
Data model impact¶
Three new fields related to the replication will be added to the
table. These will be the same fields we currently have in the
table and will hold the same meaning:
replication_status: String storing the replication status for the whole cluster.
active_backend_id: String storing which one of the replication sites is currently active.
frozen: Boolean reflecting whether the cluster is frozen or not.
These fields will be kept in sync between the
clusters table and the
services table for consistency.
REST API impact¶
A new action called
failoverequivalent to existing
failover_hostwill be added, and it will support a new
clusterparameter in addition to the
hostfield already available in
Cluster listing will accept
Cluster listing will return additional
Other end user impact¶
The client will return the new fields when listing clusters using the new microversion and new filters will also be available.
Failover for this microversion will accept the cluster parameter.
The new code should have no performance impact on existing deployments since it will only affect new Active-Active deployments.
Other deployer impact¶
Drivers that wish to support replication on Active-Active deployments will have
failover_completed methods as well as the
failover_host method since it is being used for backward
compatibility with the base replication v2.1.
The easiest way to support this with minimum code would be to implement
failover_completed and then create
def failover_host(self, volumes, secondary_id): self.failover(volumes, secondary_id) self.failover_completed(secondary_id)
- Primary assignee:
Gorka Eguileor (geguileo)
- Other contributors:
Change service start to use
active_backend_idfrom the cluster or the service.
Update list REST API method to accept new filtering fields and update the view to return new information.
Update the DB model and create migration
Make modifications to the manager to support the new RPC calls.
This work has no additional dependency besides the basic Active-Active mechanism being in place, which it already is.
Only unit tests will be implemented, since there is no reference driver that implements replication and can be used at the gate.
We also lack a mechanism to actually verify that the replication is actually working.
From a documentation perspective there won’t be much to document besides the changes related to the API changes.