..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

==========================================
Reliable quota enforcement
==========================================

Launchpad: https://blueprints.launchpad.net/neutron/+spec/better-quotas

Enforce network resource quotas in a reliable and efficient way, without
changing the quota API interface.

Problem Description
===================

Quotas enforcement is a relatively simple task. When a request for creating
one or more resources is received, the available resource quota is checked,
and if the request cannot be satisfied an error is returned to the user.
This works perfectly under the assumption that API requests are served
serially.

Unfortunately, this assumption is not true when multiple REST API workers are
employed. The same applies also if there is a single API worker, but multiple
API servers are operating behind a load balancer.

In these conditions, two or more concurrent requests can pass quota
enforcement even if there is not enough quota available to accept all of them.

For instance, assume there are 2 REST API workers and the current quota usage
is Nmax - 1 where Nmax is the resource quota.
Two requests are then sent and dispatched to the workers.
Each worker enforces the quota before the other worker completes processing
the API request.
Both requests are accepted, leading to a overall resource usage of Nmax + 1

Therefore, leveraging multiple workers and bulk creates, a tenant could in
theory create up to Nmax * Nworkers resources. While this is unlikely to
happen, there is still a good chance that tenants could manage to use more
resources then those they've been assigned by the cloud provider.
This situation is far from ideal both for the cloud provider and the user, as
the former might end up in a situation where data center resources are
oversubscribed, whereas the latter could be billed more than expected.

For RPC API workers instead, quotas are not enforced at all.
This implies that quotas are not enforced for resources created via RPC,
usually "internal" ports. While not enforcing quotas for service ports such
as DHCP ports might be a desired thing, this is probably not the correct
behavior. Indeed these ports will still contribute to the overall usage upon
enforcement for REST API requests.

Proposed Change
===============

Add a concept of 'resource reservation', which is already adopted in other
projects such as nova and cinder.
Whenever a quota enforcement check is successful, mark the requested amount of
resources as 'reserved' immediately.

The reservation shall be removed as soon as the corresponding API operation is
completed, successfully or not. This will be ideally done upon return from the
plugin call. In case of server crashes, an expiration time should be provided
to ensure stale reservations do not prevent allocation of resources upon server
restarts.

Multiple reservations can exist at anytime for the same tenant and the same
resource. The quota check should add all the existing reservations to the
current usage count.

Even with reservations, there still is a race condition which can occur if a
worker performs the quota check before the others specify their reservations.
This can be forced using lock-free algorithms which retry an operation if there
are not the conditions to perform it.
DB integrity constraints can be used to this aim, without recurring to locking
queries or other distributed synchronization constructs.

This can be achieved by introducing a table with the only purpose of keeping
track of reservations in progress.

The resource reservation process will therefore look as it follows:

1) ACQUIRE RESERVATION FACILITY
Write down in an appropriate DB table, say reservation_locks, that a given worker
is attempting to reserve a certain amount of a given resource for some tenant.

*---------------------------------------*------------*
| RESOURCE_TYPE | TENANT_ID | LOCKED_BY | EXPIRATION |  PK: RESOURCE_TYPE,
*---------------|-----------|-----------*------------*      TENANT_ID
| network       | xxx       | worker_0  | some time  |
*---------------------------------------*------------*

The selected primary key ensures that it won't be possible for two distinct
workers to attempt to concurrently reserve the same resource. The expiration
time stamp ensures stale locks are ignored in case of server failures.

2) COMMIT TRANSACTION.
This will trigger an integrity violation error if another worker attempts to
make another reservation for the same tenant and resource type.

In this case, step 1 will be tried again for a few times. There will be an
exponential backoff time between attempts.

3) DO RESERVATION

Record the reservation in an appropriate table

*--------------------------------------------*----------------*
| RESOURCE_TYPE | TENANT_ID | BOOKING_AMOUNT |   EXPIRATION   |
*---------------|-----------|----------------*----------------*
| network       | xxx       | 2              | at some point  |
*--------------------------------------------*----------------*

4) RELEASE RESERVATION FACILITY

5) COMMIT TRANSACTION

Example:

worker 0                           worker 1
----------                         ----------
acquire_resv_facility              acquire_resv_facility
COMMIT - SUCCESS                   COMMIT - FAIL (INTEGRITY VIOLATION)
do_reservation                     RETRY
release_resv_facility              COMMIT - FAIL (INTEGRITY VIOLATION)
COMMIT - SUCCESS                   RETRY
-                                  do_reservation
-                                  COMMIT - SUCCESS
-                                  release_resv_facility
-                                  COMMIT - SUCCESS

The table for acquiring the reservation facility acts as a centralized lock,
leveraging only primary keys, which are correctly enforced even in
active/active clusters such as MySql Galera.
The algorithm should however be regarded as non-blocking since workers will
always actively retry to perform the reservation; furthermore the backoff
mechanism should prevent starvation.

It is also worth noting that the proposed expiration timestamps will prevent
stale records from preventing acquiring the reservation facility or create
'ghost' resource reservation. The expiration timeout might be configurable,
but a default of at least two minutes is advisable considering that the
current neutron implementation still suffers of DB deadlocks triggered by
eventlet, which with the default MySql setting block threads for about 50
seconds.

Data Model Impact
-----------------

This document proposes two new tables:

Table name: bookings
Attributes:
resource_type  string
tenant_id      string
booking_amount integer
expiration     datetime

Table name: booking_locks
resource_type string
tenant_id     string
locked_by     string
expiration    datetime

REST API Impact
---------------

No impact on the API interface.

Security Impact
---------------

The proposed change should not open any vulnerability which could lead to
DoS, leaking tenant information, or allowing attackers to plug into
other tenant's networks.

All the DB queries performed in the implementation of this specification
will be scoped by tenant, and built in a way to prevent SQL injection.

Notifications Impact
--------------------

None.

Other End User Impact
---------------------

None.

Performance Impact
------------------

Some additional DB operations will be performed upon quota enforcement.

This change might also serialize some operations which were previously
processed in parallel by distinct workers.

While the overall performance impact is expected to be negligible, it will be
important to evaluate it upon code review.

There is also an interesting question pertaining resource usage. It is
indeed worth exploring whether it is better to count it every time or updating
their usage counter whenever resources are created or deleted.
To this aim, the cost of SELECT queries vs the cost of adding db hooks on
resource create/delete and performing the corresponding UPDATE query should be
carefully compared.
The gain/loss in terms of performance depends on relative frequency of GET
operations vs POST/DELETE operations. Update is definitely more expensive,
but SELECT are way more frequent. However, it should be possible to implement
this specification leaving resource usage calculation unchanged, and then
perhaps come back to it in the future.

IPv6 Impact
------------------

None.

Other Deployer Impact
---------------------

As service ports are not anymore counted in resource usage, deployers might
expect in theory an increase in port usage. This increase should however be
contained if one considers ports are typically used with instances, and the
enforcement criteria for instances are not being altered by this specification.

Developer Impact
----------------

The implementation for this specification will come with adequate developer
documentation.

Community Impact
----------------

None, I don't think so... But one can never know.

Alternatives
------------

One alternative worth considering is to us a single table for managing
bookings.
While the tuple (resource_type, tenant_id, resource_amount) cannot be
reliably used as a primary key for serializing bookings among workers,
the "locked_by" attribute can be added to this table and the following
tuple can constitute the primary key:

(resource_type, tenant_id, locked_by)

When committing the booking the locked_by value would be erased and this
will allow other workers to make their booking.

Nevertheless, this solution does not appear to bring benefits in terms
of performance and scalability, and also makes the code less readable.

Even if it would be possible to use locking queries (e.g.: SELECT ..
FOR UPDATE) statements, this solution has already proved not ideal, and will
also miserably fail with some DB backends if active/active replication is
employed.

Moreover, an alternative non-blocking algorithms could have been constructed
along the lines of the one proposed for Nova [#]_. That algorithm has the
advantage of performing on average 1 DB transaction instead of 2 for each
quota enforcement operation, so it is therefore more efficient.
On the other hand, the cost of a retry operation in case of conflict would
be slightly higher. While the latter details is not really important, this
alternative algorithm cannot be applied to Neutron as it does not have
resource usage counters; without them the implementation of this alternative
lock free algorithm would be rather difficult.

Another alternative would consist in a separate quota granting authority.
This is a possibility, and that is what the 'Blazar' [#]_  resource
reservation project advocates for, but might result in an overkill for many
OpenStack deployments. Moreover, while in theory it is possible to use
Blazar to this aim, the project has actually been conceived to book resources
which are meant to be time-shared across tenants.
On the other hand the 'Boson' [#]_ project might represent a more viable
alternative were quota management and enforcement are delegate to a 3rd
application. While this project is very interesting, its not yet in a
developments stage such that it can be considered for adoption by Neutron.

Finally, this problem can also be solved by introducing a distributed lock
among API workers. memcached or zookeeper could be used with relative
ease to implement this sort of distributed coordination.
Nevertheless, there is probably no need to resort to distributed
coordination if a lock-free algorithm can be devised just leveraging DB
integrity.

Implementation
==============

Assignee(s)
-----------

salv-orlando

Work Items
----------

1) Preliminary yak shaving - refactor existing quota module
2) Add resource booking logic and use them in quota enforcement
3) Remove service ports from quota enforcement

Dependencies
============

The big dependencies for this change could be the following:
1) removal of self-grown WSGI framework and subsequent switch to pecan
2) review of the plugin interface.

While the above work items pretty define new hooks for performing quota
enforcement, they won't change the logic of the quota enforcement module,
which can therefore be implemented orthogonally.

Testing
=======

Tempest Tests
-------------

None. This specification does not change the API interface nor any change
which might have an impact on the integrated gate.

Functional Tests
----------------

Appropriate functional tests will be added to validate correct quota
enforcement. As a proper verification will require triggering the race
condition, some sort of fault injection might be needed. The need and
feasibility of this will be evaluated separately.

API Tests
---------

No further API test is needed.

Documentation Impact
====================

User Documentation
------------------

Document that service ports won't count anymore in the overall
resource usage.

Developer Documentation
-----------------------

As there is currently no developer documentation for quotas, this is rather
easy: "do developer documentation for the quota enforcement module"

References
==========

.. [#] Nova lock-free quotas: https://review.openstack.org/#/c/135296
.. [#] Blazar project: https://wiki.openstack.org/wiki/Blazar
.. [#] Boson project: https://wiki.openstack.org/wiki/Boson