Temporary Resource Tracking¶
https://blueprints.launchpad.net/cinder/+spec/temp-resources
Improve Cinder’s temporary resource tracking to prevent related quota issues.
Problem description¶
Cinder doesn’t currently have a consistent way of tracking temporary resources, which leads to quota bugs.
In some cases temporary volumes use the temporary
key in the admin metadata
table to mark them and in other cases we determine a volume is temporary based
on its migration_status
field, there are even cases where volumes are not
being marked as temporary. Due to this roundabout way of marking temporary
volumes and having multiple options makes our Cinder code error prone, as is
clear by the number of bugs around it.
As for temporary snapshots, Cinder doesn’t currently have any way of reliably tracking them, so the code creating temporary resources assumes that everything will run smoothly and the deletion code in the method will be called after successfully completing the operation. Sometimes that is not true, as the operation could fail and leave the temporary resource behind, forcing users to delete them manually, which messes up the quota, since the REST API delete call doesn’t know it shouldn’t touch the quota.
When we say that we don’t have a reliable way of tracking snapshots we refer to
the fact that even though snapshots have a name that helps identify them, such
as [revert] volume %s backup snapshot
and backup-snap-%s
, these are
also valid snapshot names that a user can assign, so we cannot rely on them to
differentiate temporary snapshots.
Use Cases¶
There are several cases where this feature will be useful:
Revert to snapshot is configured to use a temporary snapshot, but either the revert fails or the deletion of the temporary volume fails, so the user ends up manually deleting the snapshot, and the quota is kept in sync with reality.
Creating a backup of an in-use volume when
backup_use_temp_snapshot
is enabled fails, or the deletion of the temporary resource failed, forcing the user to manually deleting the snapshot, and the user wants the quota to be kept in sync with reality.A driver may have some slow code that gets triggered when cloning or creating a snapshot for performance reasons but that would not be reasonable to execute for temporary volumes. An example would be the flattening of cloned volumes on the RBD driver.
Proposed change¶
The proposed solution is to have an explicit DB field that indicates whether a resource should be counted towards quota or not.
The field would be named use_quota
and it would be added to the volumes
and snapshots
DB tables. We currently don’t have temporary backups, so no
field would be added to the backups
DB table.
This would replace the temporary
admin metadata entry and the
migration_status
entry in 2 cycles, since we need to keep supporting
rolling upgrades where we could be running code that doesn’t know about the new
use_quota
field.
Alternatives¶
An alternative solution would be to use the temporary
key in the volumes’
admin metadata table like we are doing in some case and create one such table
for snapshots as well.
With that alternative DB queries could become more complex, unlike with the proposed solution where they would become simpler.
Data model impact¶
Adds a use_quota
DB field of type Boolean
to both volumes
and
snapshots
tables.
It will have an online data migration to set the use_quota
field for
existing volumes as well as an updated save
method for Volume
and
Snapshot
OVOs that sets this field whenever they are saved.
REST API impact¶
There won’t be any new REST API endpoint since the use_quota
field is an
internal field and we don’t want users or administrators modifying it.
But since this is useful information we will add this field to the volume’s
JSON response for all endpoints that return it, although with a more user
oriented name consumes_quota
:
Create volume
Show volume
Update volume
List detailed volumes
Create snapshot
Show snapshot
Update snapshot
List detailed snapshots
Security impact¶
None.
Active/Active HA impact¶
None, since this mostly just affects whether quota code is called or not when receiving REST API delete requests.
Notifications impact¶
None.
Other end user impact¶
The change requires a patch on the python-cinderclient to show the new returned
field consumes_quota
.
Performance Impact¶
There should be no performance detriment with this change, since the field would be added at creation time and would not require additional DB queries.
Moreover performance improvements should be possible in the future once we remove compatibility code with the current temporary volume checks, for example not requiring writing to the admin metadata table, making quota sync calculations directly on the DB, etc.
Other deployer impact¶
None.
Developer impact¶
By default Volume and Snapshot OVOs will use quota on creation (set
use_quota
to True
) and when developers want to create temporary
resources that don’t consume quota on creation or release it on deletion will
need to pass use_quota=False
at creation time.
Also when doing quota (adding or removing) new code will have to check this field in Volumes and Snapshots.
It will no longer be necessary to add additional admin metadata or check the
migration_status
, which should make it coding easier and reduce the number
of related bugs.
Implementation¶
Assignee(s)¶
- Primary assignee:
Gorka Eguileor (geguileo)
Work Items¶
DB schema changes.
DB online migration and OVO changes.
Update existing operations that mark volumes as temporary to use the new
use_quota
field.Update operations that are not currently marking resources as temporary to do so with the new
use_quota
field.REST API changes to return the
use_quota
field asconsumes_quota
.Cinderclient changes.
Dependencies¶
None.
Testing¶
No new tempest test will be added, since the case we want to fix is mostly around error situations that we cannot force in tempest.
Unit tests will be provided as with any other patch.
Documentation Impact¶
The API reference documentation will be updated.
References¶
Proposed Cinder code implementation:
Proposed python-cinderclient code implementation:
Proposed code to leverage this new functionality in the RBD driver to not flatten temporary resources: