Mechanism to Prevent Race Conditions¶

https://blueprints.launchpad.net/manila/+spec/eliminate-race-conditions

This proposal is to develop a general solution for preventing race conditions which will work across services and in deployments where there are multiple copies of the same services (commonly known as Active/Active HA deployments).

The focus is on keeping all state in the database, and protecting changes to database state using briefly-held locks. Also concurrent operations which are mutually exclusive should fail as early as possible with a helpful error code to simplify the retry logic of upper layers.

Problem description¶

Certain operations in Manila should not be allowed to proceed in parallel, because the result of one operation would prevent the other operation from completing successfully.

For example, taking a snapshot of a share cannot happen simultaneously while deleting that share. Either the snapshot must occur first, which prevents the delete – or the delete must occur first, which prevents the snapshot. Unfortunately, not enough state is stored in the database to prevent these operations from racing with each other, so in practice two API calls can both proceed through the API service to the share manager, where eventually an error will occur and one or both operations will fail mysteriously.

There are multiple scenarios like the above where undefined behavior results. This specification does not attempt to enumerate all of them because the goal is to describe a mechanism for fixing these kinds of issues rather than explicitly fixing all such issues. Generally speaking, race conditions should be treated as bugs, but up until now Manila has lacked the tools to fix these bugs reliably.

Use cases¶

Specific cases:

Two snapshot operations should not be able to occur at the same time. One must complete before the second can begin. This ensures that snapshots occur in a known order, and prevents the useless situation of having 2 identical snapshots.
Taking a snapshot of a share should prevent a delete of that share.
Valid changes to access rules should always be accepted regardless of the state of the existing access rules. Although rules are applied to the backend asynchronously, it’s valid to add multiple rules faster than the system can apply them and expect Manila to catch up.

General cases:

These guarantees must be enforceable using a database running in a clustered configuration such as Galera. This prevents obvious solutions such as relying on DB row-level locking.
These guarantees must be enforceable while running multiple copies of Manila services, including multiple API, scheduler, and share manager services. This prevents obvious solutions like in-process locks.
These guarantees must be enforceable while running in a distributed configuration where cooperating services are on different nodes (physical, VMs, or containers). This prevents our existing approach of using file locks. Even though network-based file locking solutions exist, they represent a single point of failure and are unacceptable in properly distributed environments.
The Manila services should be able to automatically and gracefully recover from crashes and other unplanned downtime. This means that implicit state should be avoided and because long-held locks are implicit state they should be avoided in favor of short-held locks with explicit state.

Proposed change¶

More transitional states will be added so that operations which can conflict with other operations can be explicitly detected by looking at the share state. This spec only proposes one specific new state to address the races involving snapshots, but more generally provides a framework for resolving similar races as they are discovered.

Transitions between states will always be done while holding a distributed lock – a lock implemented by a distributed lock manager (DLM). Use of distributed locks ensures that all services see the same locking state even if services run on different nodes, and even across transient failures such as node failures and network partitions. The lock will be held only for the duration of the database test-and-set operation to minimize lock contention.

No locks will be held during calls from the share manager to the share driver. Mutual exclusion between driver calls will be achieved with state checks.

No locks will be held during RPC calls or casts.

Alternatives¶

The approach used by Cinder which relies on elaborate SQL calls to compare-and-swap fields was considered but rejected for the following reasons:

The code in Cinder can’t be shared with Manila because it relies on OVO (Oslo Versioned Objects)
Not enough people understand how it works so it’s likely to be hard to maintain.
Cinder’s compare-and-swap approach limits the kind of state changes you can make because updating multiple tables atomically is impossible. Locks don’t suffer from this restriction.

Data model impact¶

New states will be added:

Snapshotting
States for access rules covered in Access rules spec

digraph share_states { label="Share States" // Transitional States creating[shape=hexagon]; manage_starting[shape=hexagon]; deleting[shape=hexagon]; snapshotting[shape=hexagon,color=gold4, fontcolor=gold4]; migrating[shape=hexagon]; shrinking[shape=hexagon]; extending[shape=hexagon]; unmanage_starting[shape=hexagon]; replication_change[shape=hexagon]; // Error states error[color=red4, fontcolor=red4]; shrinking_error[color=red4, fontcolor=red4]; shrinking_possible_data_loss_error[color=red4, fontcolor=red4]; extending_error[color=red4, fontcolor=red4]; unmanage_error[color=red4, fontcolor=red4]; manage_error[color=red4, fontcolor=red4]; error_deleting[color=red4, fontcolor=red4]; // Other states new[color=blue, fontcolor=blue]; available[color=darkgreen, fontcolor=darkgreen]; deleted[shape=box, color=navy, fontcolor=navy]; unmanaged[shape=box, color=navy, fontcolor=navy]; // User requested transitions new -> creating[label="create"]; new -> manage_starting[label="manage"]; available -> deleting[label="delete"]; available -> snapshotting[label="create snapshot", color=gold4, fontcolor=gold4]; available -> migrating[label="migrate"]; available -> shrinking[label="shrink"]; available -> extending[label="extend"]; available -> unmanage_starting[label="unmanage"]; available -> replication_change[label="add replica"]; // Automatic transitions creating -> available[label="success", color=darkgreen, fontcolor=darkgreen]; deleting -> deleted[label="success", color=darkgreen, fontcolor=darkgreen]; snapshotting -> available[label="success", color=darkgreen, fontcolor=darkgreen]; manage_starting -> available[label="success", color=darkgreen, fontcolor=darkgreen]; unmanage_starting -> unmanaged[label="success", color=darkgreen, fontcolor=darkgreen]; extending -> available[label="success", color=darkgreen, fontcolor=darkgreen]; shrinking -> available[label="success", color=darkgreen, fontcolor=darkgreen]; replication_change -> available[label="success", color=darkgreen, fontcolor=darkgreen]; // Reset transitions error -> available[label="reset"]; shrinking_error -> available[label="reset"]; extending_error -> available[label="reset"]; unmanage_error -> available[label="reset"]; manage_error -> available[label="reset"]; error_deleting -> available[label="reset"]; // Error transitions creating -> error[label="fail", color=red4, fontcolor=red4]; migrating -> error[label="fail", color=red4, fontcolor=red4]; shrinking -> shrinking_error[label="fail", color=red4, fontcolor=red4]; shrinking -> shrinking_possible_data_loss_error[label="fail", color=red4, fontcolor=red4]; extending -> extending_error[label="fail", color=red4, fontcolor=red4]; unmanage_starting -> unmanage_error[label="fail", color=red4, fontcolor=red4]; manage_starting -> manage_error[label="fail", color=red4, fontcolor=red4]; snapshotting -> error[label="fail", color=red4, fontcolor=red4]; deleting -> error_deleting[label="fail", color=red4, fontcolor=red4]; }

REST API impact¶

New states will be visible through any API that shows states. Also new error conditions will become possible as we detect races earlier and report them directly.

The behavioral changes related to locking will not be microversioned, as it won’t be possible or desirable to emulate the old behavior once the changes are implemented. However in cases where new states are added, those changes will be microversioned so that clients which depend on the new states can detect that the server supports them.

Driver impact¶

None

Security impact¶

None

Notifications impact¶

None

Other end user impact¶

None

Performance Impact¶

Distributed locking is expected to moderately slow down state changes. Also adding more state changes will slow down operations that require them.

Other deployer impact¶

Requirement to deploy and configure suitable Tooz backend. Since Manila will depend on tooz for correctness, tooz backends that fail to meet the API contract won’t be suitable.

Developer impact¶

This will be significant. Developers will need to follow the new model for all features that involve state changes. Also care will be needed with locks to avoid deadlock situations. Holding locks for very limited time will help avoid deadlocks but in case 2 locks are ever held at the same time, they need to be deadlock-proofed by establishing a lock order.

Implementation¶

Assignee(s)¶

bswartz

Work Items¶

Add snapshotting state
Complete tooz integration
Wrap state changes with tooz locks

Dependencies¶

Tooz

Testing¶

Existing tests will help ensure no regressions, but to detect race conditions we need rally tests or similarly high-concurrency tests.

Documentation Impact¶

Admin guide - need to document tooz requirements.

Developer reference - need to document state machines and locking protocol

References¶

Access rules spec: https://review.openstack.org/#/c/399049/

Tooz integration: https://review.openstack.org/#/c/318336/

Mechanism to Prevent Race Conditions

Mechanism to Prevent Race Conditions¶

Problem description¶

Use cases¶

Proposed change¶

Alternatives¶

Data model impact¶

REST API impact¶

Driver impact¶

Security impact¶

Notifications impact¶

Other end user impact¶

Performance Impact¶

Other deployer impact¶

Developer impact¶

Implementation¶

Assignee(s)¶

Work Items¶

Dependencies¶

Testing¶

Documentation Impact¶

References¶

manila-specs 0.0.1.dev201

Page Contents