Temporary Resource Tracking¶
Improve Cinder’s temporary resource tracking to prevent related quota issues.
Cinder doesn’t currently have a consistent way of tracking temporary resources, which leads to quota bugs.
In some cases temporary volumes use the
temporary key in the admin metadata
table to mark them and in other cases we determine a volume is temporary based
migration_status field, there are even cases where volumes are not
being marked as temporary. Due to this roundabout way of marking temporary
volumes and having multiple options makes our Cinder code error prone, as is
clear by the number of bugs around it.
As for temporary snapshots, Cinder doesn’t currently have any way of reliably tracking them, so the code creating temporary resources assumes that everything will run smoothly and the deletion code in the method will be called after successfully completing the operation. Sometimes that is not true, as the operation could fail and leave the temporary resource behind, forcing users to delete them manually, which messes up the quota, since the REST API delete call doesn’t know it shouldn’t touch the quota.
When we say that we don’t have a reliable way of tracking snapshots we refer to
the fact that even though snapshots have a name that helps identify them, such
[revert] volume %s backup snapshot and
backup-snap-%s, these are
also valid snapshot names that a user can assign, so we cannot rely on them to
differentiate temporary snapshots.
There are several cases where this feature will be useful:
Revert to snapshot is configured to use a temporary snapshot, but either the revert fails or the deletion of the temporary volume fails, so the user ends up manually deleting the snapshot, and the quota is kept in sync with reality.
Creating a backup of an in-use volume when
backup_use_temp_snapshotis enabled fails, or the deletion of the temporary resource failed, forcing the user to manually deleting the snapshot, and the user wants the quota to be kept in sync with reality.
A driver may have some slow code that gets triggered when cloning or creating a snapshot for performance reasons but that would not be reasonable to execute for temporary volumes. An example would be the flattening of cloned volumes on the RBD driver.
The proposed solution is to have an explicit DB field that indicates whether a resource should be counted towards quota or not.
The field would be named
use_quota and it would be added to the
snapshots DB tables. We currently don’t have temporary backups, so no
field would be added to the
backups DB table.
This would replace the
temporary admin metadata entry and the
migration_status entry in 2 cycles, since we need to keep supporting
rolling upgrades where we could be running code that doesn’t know about the new
An alternative solution would be to use the
temporary key in the volumes’
admin metadata table like we are doing in some case and create one such table
for snapshots as well.
With that alternative DB queries could become more complex, unlike with the proposed solution where they would become simpler.
Data model impact¶
use_quota DB field of type
Boolean to both
It will have an online data migration to set the
use_quota field for
existing volumes as well as an updated
save method for
Snapshot OVOs that sets this field whenever they are saved.
REST API impact¶
There won’t be any new REST API endpoint since the
use_quota field is an
internal field and we don’t want users or administrators modifying it.
But since this is useful information we will add this field to the volume’s
JSON response for all endpoints that return it, although with a more user
List detailed volumes
List detailed snapshots
Active/Active HA impact¶
None, since this mostly just affects whether quota code is called or not when receiving REST API delete requests.
Other end user impact¶
The change requires a patch on the python-cinderclient to show the new returned
There should be no performance detriment with this change, since the field would be added at creation time and would not require additional DB queries.
Moreover performance improvements should be possible in the future once we remove compatibility code with the current temporary volume checks, for example not requiring writing to the admin metadata table, making quota sync calculations directly on the DB, etc.
Other deployer impact¶
By default Volume and Snapshot OVOs will use quota on creation (set
True) and when developers want to create temporary
resources that don’t consume quota on creation or release it on deletion will
need to pass
use_quota=False at creation time.
Also when doing quota (adding or removing) new code will have to check this field in Volumes and Snapshots.
It will no longer be necessary to add additional admin metadata or check the
migration_status, which should make it coding easier and reduce the number
of related bugs.
- Primary assignee:
Gorka Eguileor (geguileo)
DB schema changes.
DB online migration and OVO changes.
Update existing operations that mark volumes as temporary to use the new
Update operations that are not currently marking resources as temporary to do so with the new
REST API changes to return the
No new tempest test will be added, since the case we want to fix is mostly around error situations that we cannot force in tempest.
Unit tests will be provided as with any other patch.
The API reference documentation will be updated.
Proposed Cinder code implementation:
Proposed python-cinderclient code implementation:
Proposed code to leverage this new functionality in the RBD driver to not flatten temporary resources: