Right now cinder-volume service can run only in Active/Passive HA fashion.
One of the reasons for this is that we have no concept of a cluster of nodes that handle the same storage back-end.
This spec introduces the concept of cluster to Cinder and aims to provide a way for Cinder’s API and Scheduler nodes to distribute jobs to Volume nodes on a High Availability deployment with Active/Active configuration.
Right now cinder-volume service only accepts Active/Passive High Availability configurations, and current job distribution mechanism does not allow having multiple services grouped in a cluster where jobs can be queued to be processed by any of those nodes.
Jobs are currently being distributed using a topic based message queue that is identified by the volume_topic, scheduler_topic, or backup_topic prefix joined with the host name and possibly the backend name if it’s a multibackend node like in cinder-volume.localhost@lvm, and that’s the mechanism used to send jobs to the Volume nodes regardless of the physical address of the node that is going to be handling the job, allowing an easier transition on failover.
Chosen solution must be backward compatible as well as allow the new Active/Active configuration to effectively send jobs.
In the Active/Active configuration there can be multiple Volume services - this is not mandatory at all times, as failures may leave us with only 1 active service - with different host configuration values that can interchangeably accept jobs that are handling the same storage backend.
Operators that have hard requirements, SLA or other reasons, to have their cloud operational at all times or have higher throughput requirements will want to have the possibility to configure their deployments with an Active/Active configuration.
To provide a mechanism that will allow us to distribute jobs to a group of nodes we’ll add a new cluster configuration option that will uniquely identify a group of Volume nodes that share the same storage backends and therefore can accept jobs for the same volumes interchangeably.
This new configuration option, unlike the host option, will be allowed to have the same value on multiple volume nodes, with the only requisite that all nodes that share the same value must also share the same storage backends and they must also share the same configurations.
By default cluster configuration option will be undefined, but when a string value is given a new topic queue will be created on the message broker to distribute jobs meant for that cluster in the form of cinder-volume.cluster@backend similar to already existing host topic queue cinder-volume.host@backend.
It is important to notice that cluster configuration option is not a replacement of the host option as both will coexist within the service and must exist for Active-Active configurations.
To be able to determine the topic queue where an RPC caller has to send operations we’ll add cluster_name field to any resource DB table that currently has the host field we are using for non Active/Active configurations. This way we don’t need to check the DB, or keep a cache in memory, to figure out in which cluster is this service included, if it is in a cluster at all.
Once the basic mechanism of receiving RPC calls on the cluster topic queue is in place, operations will be incrementally moved to support Active-Active if the resource is in a cluster, as indicated by the presence of a value in the cluster_name resource field.
The reason behind this progressive approach instead of an all or nothing approach is to reduce the possibility of adding new bugs and facilitating quick fixes by just reverting a specific patch.
This solution makes a clear distinction between independent services and those that belong to a cluster, and the same can be said about resources belonging to a cluster.
To facilitate the inclusion of a service in a cluster, the volume manager will detect when the cluster value has changed from being undefined to having a value and proceed to include all existing resources in the cluster by filling the cluster_name fields.
Having both message queues, one for the cluster and one for the service, could prove useful in the future if we want to add operations that can target specific services within a cluster.
With Active/Passive configurations a storage backend service is down whenever we don’t have a valid heartbeat from the service and is up if we do. These heartbeats are reported in the DB in services table.
On Active/Active configurations a service is down if there is no valid heartbeat from any of the services that constitute the cluster, and it is up if there is at least one valid heartbeat.
Services will keep reporting their heartbeats in the same way that they are doing it now, and it will be Scheduler’s job to separate between individual and clustered services and aggregate the latter by cluster name.
As explained in REST API impact the API will be able to show cluster information with the status -up or down- of each cluster, based on the services that belong to it.
This new mechanism will change the “disabling working unit” from service to cluster for services that are in a cluster. Which means that once all operations that go through the scheduler have been moved to support Active-Active configurations, we won’t be able to disable an individual service belonging to a cluster and we’ll have to disable the cluster itself. For non clustered services, disabling will work as usual.
Disabling a cluster will prevent schedulers from taking that cluster, and therefore all its services, into consideration during filtering and weighting and the service will still be reachable to all operations that don’t go through the scheduler.
It stands to reason that sometimes we’ll need to drain nodes to remove them from a cluster, but this spec and its implementation will not be adding any new mechanism for that. So existing mechanism, using SIGTERM, should be used to perform graceful shutdown of cinder volume services.
Current graceful shutdown mechanism will make sure that no new operations are received from the messaging queue while it waits for ongoing operations to complete before stopping.
It is important to remember that graceful shutdown has a timeout that will forcefully stop operations if they take longer than the configured value. Configuration option is called graceful_shutdown_timeout, goes in [DEFAULT] section and takes a default value of 60 seconds; so this should be configured in our deployments if we think this is not long enough for our use cases.
All Volume services periodically report their capabilities to the schedulers to keep them updated with their stats, that way they can make informed decisions on where to perform operations.
In a similar way to the Service state reporting we need to prevent concurrent access to the data structure when updating this information. Fortunately for us we are storing this information in a Python dictionary on the schedulers, and since we are using an eventlet executor for the RPC server we don’t have to worry about using locks, the inherent behavior of the executor will prevent concurrent access to the dictionary. So no changes are needed there to have exclusive access to the data structure.
Although rare, we could have a consistency problem among volume services where different schedulers would not have the same information for a given backend.
When we had only 1 volume service reporting for each given backend this was not a situation that could happen, since received capabilities report was always the latest and all scheduler services were in sync. But now that we have multiple volume services reporting on the same backend we could receive two reports from different volume services on the same backend and they could be processed in different order on different schedulers, thus making us have different data on each scheduler.
The reason why we can’t assure that all schedulers will have the same capabilities stored in their internal structures is because capabilities reports can be processed in different order on different services. Order is preserved in almost all stages, volume services report in a specific order and message broker preserves this order and they are even delivered in the same order, but when each service processes them we can have greenthreads execution in different order on different scheduler services thus ending up with different data on each service.
This case could probably be ignored since it’s very rare and differences would be small, but in the interest of consistent of the backend capabilities on Scheduler services, we will timestamp the capabilities on the volume services before they are sent to the scheduler, instead of doing it on the scheduler as we are doing now. And then we’ll have schedulers drop any capabilities that are older than the one in the data structure.
By making this change we facilitate new features related to capability reporting, like capability caching. Since capability gathering is usually an expensive operation and in Active-Active configurations we’ll have multiple volume services requesting the same capabilities with the same frequency for the same back-end, we could consider capability caching as solution to decrease the cost of the gathering on the backend.
One alternative to proposed job distribution would be to leave the topic queues as they are and move the job distribution logic to the scheduler.
The scheduler would receive a job and then send it to one of the volume services that belong to the same cluster and is not down.
This method has one problem, and that is that we could be sending a job to a node that is down but whose heartbeat hasn’t expired yet, or one that has gone down before getting the job from the queue. In these cases we would end up with a job that is not being processed by anyone and we would need to either wait for the node to go back up or the scheduler would need to retrieve that message from the queue and send it to another active node.
An alternative to proposed heartbeats is that all services report using cluster@backend instead of host@backend like they are doing now and as long as we have a valid heartbeat we know that the service is up.
There are 2 reasons why I believe that sending independent heartbeats is a superior solution, even if we need to modify the DB tables:
Another alternative for the job distribution, which was the proposed solution in previous versions of this specification, was to use host configuration option as the equivalent to cluster grouping a new added node configuration option that would serve to identify individual nodes.
Using such solution may lead to misunderstandings with the concept of hosts as clusters, whereas using the cluster concept directly there is no such problem, wouldn’t allow a progressive solution as it was a one shot change, and we couldn’t send messages to individual volume services since we only had the host message topic queue.
There is a series of patches showing the implementation of the node alternative mechanism that can serve as a more detailed explanation.
Another possibility would be to allow disabling individual services within a cluster instead of having to disable the whole cluster, and this is something we can take up after everything else is done. To do this we would use the normal host message queue on the cinder-volume service to receive the enable/disable of the cluster on the manager and that would trigger a start/stop of the cluster topic queue. But this is not trivial, as it requires us to be able to stop and start the client for the cluster topic from the cinder volume manager (it is managed at the service level) and be able to wait for a full stop before we can accept a new enable request to start the message client again.
A new clusters table will be added with the following fields:
A cluster_name field will be added to existing services, volumes, and consistencygroups tables.
Service listing will return cluster_name field when requested with the appropriate microversion.
A new clusters endpoint will be added to list -detailed and summarized-, show, and update operations with their respective policies.
Negligible if we implement the aggregation of the heartbeats on a SQL query using exist instead of retrieving all heartbeats and doing the aggregation on the scheduler.
Unittests for new API behavior.
This spec has changes to the API as well as a new configuration option that will need to be documented.