Volume state enforcer¶
Concurrent resource access in cinder is a problem that has caused resource corruption when simultaneous resources are mutated on by multiple cinder entrypoints (api and manager for example). In Icehouse there has been some addition & usage of locks around manager functions to queue up those requests when a resource is being simultaneous worked on by multiple functions (this stops one of those operations from concurrently mutating the underlying resource). Sadly this is more of a sledgehammer approach and hides the symptoms of the problem and makes it non-obvious when debugging what other requests are queued up behind the lock (or why dead-locking is occurring, if and when it does).
To help alleviate and hopefully solve this problem we will try to attack some
of these issues in a different manner, integrating a allowed state transition
table into the
create_volume workflow and doing strategic state
transitions and aborting/erroring out when these state transitions are not
allowed. In the future this will help create a concrete set of well defined
states and transitions for other workflows as well (and will make it clear
while looking at code and during debugging which transitions are allowed at the
same time and what transitions are actively occurring).
A high-level description of the problem:
Concurrent resource mutation, bad (EOM).
More detailed description:
Locks in cinder are being added to protect against simultaneous resource modification, for example in
delete_volume, detach_volume...a external lock is acquired in the manager with name
volume_id, f.__name__. This has helped make the manager more safe to concurrent resource access but the initial goal of this was for it to only be a temporary solution to a wider problem. One of the issues with this mechanism is that it is not using a DLM (distributed lock manager) but only a local filesystem lock instead. This means that a cinder-api service can mutate the resource (or initiate a request to do this) while a second mutation is actively in flight. When a single manager is active this will work out (since one of the in flight requests will backup behind the external lock). This solves the problem when a single master manager is running; yet this is an atypical deployment pattern and should not be recommended as the way to deploy and run cinder (it should be horizontally scalable so that there can be X active managers, where X is > 1). We need some other type of solution that scales horizontally but also solves the same end goal (disallowing simultaneous resource mutation by X entities at the same time).
Since the scope of this problem is bigger (it applies to all/most operations
that act on resources) we have to start somewhere so we will start by working
through how this will look for the
create_volume workflow. It does raise a
larger question of how can this change be done in a piecemeal fashion since
the other operations will still be lock dependent, and mixing state transitions
and lock acquisition techniques will likely not end in a correct solution. We
will have to explore how to do this in a way that is piecemeal but also does
not destabilize cinder more.
Instead of acquiring local filesystem locks in the manager processes refactor the concept of a lock to instead be a set of allowed and disallowed state transitions (which is in concept similar to the internal mechanism that a lock uses anyway).
Lets take an abbreviated example of how this could work:
When a volume is requested to be created, a database record is created for
this volume, in this database record there exists a field called
status (in fact there exists multiple of these statuses fields, in the
future these should maybe be removed?) that is used to report back to the user
data about the status of the create volume request as it moves through the
various components in cinder (api, scheduler, and manager).
This status itself has a expected transition diagram and itself is a starting
point in determining the larger states transitions that a cinder volume create
request goes through (and is allowed to go through). Instead of overriding
status field this proposal proposes to augment the data storage layer
in cinder with a new
resource_states table. It may be represented by
something other than a table depending on where this data is stored (if
zookeeper was used it would be represented as a resource tree), the only
constraint that we must enforce is that we can atomically fetch and update
the given state of a resource in a single atomic operation.
A potential schema could look like the following:
7c92ee46-7a2e-4183-99c5-909f3d46a90e 7c92ee46-7a2e-4183-99c5-909f3d46a90e 7c92ee46-7a2e-4183-99c5-909f3d46a90e 7c92ee46-7a2e-4183-99c5-909f3d46a90e
CREATING_DB SCHEDULING CREATING_VOL NULL/None
2014-05-22T15 2014-05-23T15 2014-05-23T15 2014-05-24T15
This table structure will then be used (with
NULL states to delimit when
a request has fulfilled its set of allowed state transitions) to determine at
the API level (before a request has been accepted) what a resource is currently
being used for and the API server can then attempt to initiate a transition to
a desired state (for example,
DETACHING) and depending on if this
transition is allowed (by looking at the last known state) it may fail or
succeed at performing this transition. If it succeeds it continues with the
rest of the workflow for the desired operation (subsequent transitions will
also be made in the rest of the workflow, as needed, with the final transition
being a transition to
NULL/None, to denote that the operation has
completed). If the transition is disallowed/fails the API request will be
denied and the operation will not be allowed to make forward progress (in the
future this model can be relaxed to allow for simultaneous state transitions
for operations where this makes sense).
To accomplish this, in the
create_volume operation there exists the usage
of taskflow, which has helped decompose the workflows that volume creation
goes through (it also makes it possible to resume from a prior state if the
process crashes). This decomposition makes it obvious (or more obvious) where
the transitions should occur and what the transitions are. The proposed path
is to add in new nodes into the workflow that will perform & validate
these state transitions (attempting to mutate the above resource state table)
at a granular-enough level to be useful & meaningful (the transition table also
can be useful for operators and developers attempting to determine what is
happening inside cinder). When this is combined with notifications from
taskflow about its own internal states (via notifications) the ability to
decipher what is going on internally to cinder becomes very easy & provides
invaluable information to users, developers and operators using & operating
One possibility for avoiding the above
resource_states table is to use
a DLM and use a similar approach that is being used with file locks in
cinder. The usage of it would be similar to the usage of file locks, although
there are scenarios at RPC boundaries where it would still require state
transition validation. For example when a lock is released and an async RPC
call is made there becomes the possibility for other async RPC calls to also
be active at the same time and there would require a state transition and
lock system to be used when the receivers of those RPC calls accept and perform
the requested RPC operation.
Another possible solution that does not require state transitions is to not use async RPC calls but instead use sync RPC calls, and the sender would only release the DLM lock it owns after it has received confirmation that the receiver has started to process (or accepted the request). The receiver would then acquire the lock during this period when it accepts the request, ensuring that correct lock hand-off happens between the send and receiver. This would require a sensitive and hard to get correct lock hand-off code path & process (this path would need to be tested heavily to ensure correctness).
IMHO both of these alternative methods are too fragile and do not make the state transition process and diagram obvious to developers, operators, and users. This lack of information impedes cinder adoption, and makes it more difficult to recovery from (and understand) inevitable failures and operational issues.
What this does not solve¶
I would also like include a note to what the scope of this specification does not encompass.
It does not encompass cross-project resource usage and inconsistencies related to state transitions being done by a project using cinder (for example the initiation of a detach of a volume by nova will not be aborted early in the nova API flow, but instead will be aborted later in the workflow if cinder is performing other state transitions on that resource).
It also does not also stop cinder from deleting a volume underneath nova (aka a VM can be using a volume while cinder is deleting it).
These are larger cross-project consistency issues and will need to be solved at a higher level across the projects. It should be noted that once a project itself has a consistent set of states and transitions it becomes much easier to make cross-project consistency possible (without internal consistency cross-project resource usage might as well be discouraged/avoided).
Data model impact¶
See the above proposed table.
We must be careful to retain the existing API so that nova which is dependent on cinders currently visible states continues to work. This just means that we need to have a exposed mapping that nova is compatible with; while we have an internal mapping which is much more detailed and consistent.
REST API impact¶
Maybe in the future.
None currently, the state transition information could also be sent out to the notification system if this is desirable in the future to do so.
Other end user impact¶
End users should now expect more errors (or try again later) responses when performing operations concurrently on the same set of resources. Previously some of these operations may or may not have succeeded.
A new table will be created in sqlalchemy and a new model will be created for this new schema. This table will be high read and write traffic (since all operations that occur in cinder will write data to it) so it might be recommended to alter the table type to a more friendly format that performs better for this tables limited usage. Since this table is relatively simple it should also be possible in the future (when correctness is achieved) to switch this table to some other backend that can optimize itself for small read/writes with little history (history is not as useful, except for operators and developers who wish to interrogate what has happened to a resource in the past).
Other deployer impact¶
Developers would likely get a lot of the benefit of this information to start since it will help them understand the states a workflow goes through (at the cinder level), combining this with the event stream that taskflow emits creates a lot of useful runtime information that can be used while running cinder or while developing cinder (where to add new state transitions in becomes more obvious when the state transitions that occur are well defined and understood).
You the person reading this?
Determine state digram and debate what states should be used internally to cinder (the critical must-have states) and what states are more informational (DuncanT has apparently done some of this analysis).
Create database schema migration/addition for the decided upon new schema.
Create database models for new schema (and determine and discuss on how the atomic state update will be accomplished).
Identify key locations where these state transitions will occur (before or after which taskflow tasks) or at a layer outside of taskflow.
Add new tests that trigger these new state transitions and violation checks, ensuring that what is desired to occur actually occurs.
Simultaneously work on creating a model inside of taskflow that can help other projects avoid recreating chunks of the above code for there own similar needs/use-cases.
Test like crazy.
Do load-testing/concurrency-testing (using rally or tempest) to verify the improvement has helped and not hurt cinder.
J/3 into K (this is likely not a short-term specification).
Since this change affects how cinder operates at a low level, it will require a good amount of testing to verify that concurrent operations are disallowed. Currently tempest may not be the best way to test these concurrent operations since to my knowledge it does not run in parallel (and only when it runs in a controlled parallel process can u find these concurrency issues). So the way to test these concurrency issues needs to be determined (is rally the way to go here, using its concurrent scenarios to probe that this feature works?).
There may be new documentation required to explain why operations that were allowed to occur concurrently are no longer allowed to occur concurrently since this new state transition will be more strict as to what can and what can not occur at the same time.
It will also become possible to start to form documents like taskflow states that show exactly what the internals of cinder are doing and what the allowed state transitions (aka the cinder reference operation states) are.