.. This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode ================================ A lock-free quota implementation ================================ https://blueprints.launchpad.net/nova/+spec/lock-free-quota-management Implement a lock-free quota management algorithm that removes the use of the SELECT FOR UPDATE in the database API. Problem description =================== When launching a single instance in Nova, more than 120 database queries may be made against the Nova, Neutron, Glance and/or Cinder databases. Of these queries, a significant portion of them involve quota management tasks -- checking for existing quotas, checking for existing project and user usage records, claiming the resources used in a reservation, and either committing or rolling back those reservations once the launch sequence succeeds or fails. The `SELECT FOR UPDATE` SQL construct is used in Nova in a couple places, to ensure that no two concurrent threads attempt to update the same rows in the database. When a thread selects records with `SELECT FOR UPDATE`, the thread is announcing that it intends to modify the records it is reading from the table -- this is called a write-intent lock. If another thread wants to update the same set of records, it will issue a `SELECT FOR UPDATE` call, and this call will wait until the first thread has completed the transaction and either issued a `COMMIT` or a `ROLLBACK`. In the case of traditional RDBMS systems like PostgreSQL or MySQL, calls to `SELECT FOR UPDATE` are, by nature, a detriment to scalability, since only a single thread may hold the write-intent lock on the same rows in the table at any given time. All other threads must wait while the single writer thread finishes what it is doing. In the case of Nova, the use of `SELECT FOR UPDATE` is predominantly in two areas: quota management and assignment of free IP addresses in nova-network's IPAM layer. In addition to `SELECT FOR UPDATE`'s inherent scalability issues, a popular replication variant of MySQL, called MySQL Galera, does not support the write-intent locks that `SELECT FOR UPDATE` requires. What this means, in practice, for deployers of Nova with MySQL Galera, is that occasionally the MySQL client will return a deadlock error when two threads simultaneously attempt to change the same set of rows. A deadlock in the traditional sense of the term does not actually occur, but Galera raises the error code for an InnoDB lock wait timeout (deadlock has occurred) when something called a certification failure happens. A certification failure happens when two threads writing to two different nodes in a Galera cluster attempt to `UPDATE` the same set of rows in the same table during the same time interval. Instead of causing MySQL to contain inconsistent data (two nodes having a different idea of the underlying data), Galera simply causes both threads to fail and thus issue a `ROLLBACK` of the containing transaction. This is different behavior from standard MySQL, in which a similar situation would cause just one of the threads to `ROLLBACK` after receiving a lock wait timeout error, and the other thread's `UPDATE` would succeed. The reason that this "deadlock" is not actually a deadlock in the Galera case is that the entire process happens without any actual waits or timeout loops. Each conflicting thread is simply sent an error and that thread issues a `ROLLBACK` of the current SQL transaction. Since MySQL Galera is, by far, the most popular high-availability database deployment option currently in the operator ecosystem, some changes are required in the quota management code to replace the use of `SELECT FOR UPDATE` with a lock-free implementation that suffers neither scalability problems nor the Galera-specific quasi-deadlock problems. Use Cases --------- A deployer typically uses a system like Tempest or Rally to provide an "acceptance test" of their OpenStack deployment. Both of these systems stress the Nova system by attempting to spawn and take actions against many instances, using many users and projects. Both of these tests realiably produce error scenarios where from the Nova log files, the deployer can see that a deadlock retry mechanism is being used to handle failure of the MySQL Galera cluster to certify quota usage update records. This deadlock retry mechanism is heavy-handed and clearly affects the performance and scalability of Nova. The deployer wants to see the Nova logs free of retry messages, and would like to see improved throughput of the system as a whole. Project Priority ---------------- This is not related to any of the project priorities listed for Kilo. Proposed change =============== The proposed solution is to borrow a page from obstruction-free and lock-free algorithm design and use a "compare and swap" method that allows the thread that intends to change the quota usage records for a user or project to issue a standard `SELECT` statement for those records, and when that thread goes to update those records, it first checks that the state of records is what the thread previously knew to exist. The `UPDATE` statement will include a `WHERE` condition that will ensure that the rows are *only* updated in the table IFF the current row values are what the thread thought they were when previously reading the rows with the `SELECT` statement. The thread will check the number of rows affected by the `UPDATE` statement. If the number of rows affected is 0, then a randomized exponential backoff loop will be hit and the process of reading and then `UPDATE` ing with the `WHERE` condition will repeat until a pre-defined number of tries has been attempted. This algoritm is lock-free, in that no record locks of any kind are taken at any point in the quota management transactions. Alternatives ------------ None Data model impact ----------------- None REST API impact --------------- None Security impact --------------- None Notifications impact -------------------- None Other end user impact --------------------- None Performance Impact ------------------ We will work with the Rally developer team to identify some pre and post benchmarks that should demonstrate better concurrency with this lock-free implementation under both MySQL and PostgreSQL deployments. Other deployer impact --------------------- None Developer impact ---------------- None Implementation ============== We will implement the lock-free algorithm entirely in the quota_reserve(), quota_rollback() and quota_commit() DB API methods. The use of `with_lockmode('update')` shall be removed from the query object construction in the `_get_project_user_quota_usages()` method in `nova.db.sqlalchemy.api`. Within `quota_reserve()`, `quota_commit()` and `quota_rollback()`, we will change the algorithm from this *simplified* pseudo-code for `quota_reserve()`:: start_transaction: usage_records = get_and_lock_usage_records() reservations = [] for resource, amount in requested_resource_changes: reservation = reservation_record_create(resource, amount) reservations.append(reservation) commit_transaction return reservations to this:: current_usage_records = get_usage_records() for resource, amount in requested_resource_changes: while num_attempts < max_attempts: if usage_records_update(resource, amount, current_usage_records): break num_attempts++ current_usage_records = get_usage_records() return requested_resource_changes where the `usage_records_update()` method would look like this, again, in pseudo-code:: def usage_records_update(resource, amount, current_records): sql = "UPDATE quota_usage SET used = used + amount WHERE resource = $resource AND used = $current_records.used" execute_sql() return num_affected_rows() > 0 It's important to note that the implementation above proposes to remove the transactional container around the retrieval and update of quota usage records. This is due to the need to have each `UPDATE` SQL statement occur within its own transactional container. Otherwise, the semantics of the `REPEATABLE_READ` isolation level would mean the transaction would not see any changes from other transactions. Assignee(s) ----------- Primary assignees: AlexFrolov pkholkin Work Items ---------- * Add a new quota driver `nova.quota.ConcurrentDbDriver` class that we can use to isolate testing of this new functionality and ensure that the existing quota DB driver can remain the default quota driver while this work is completed. * Add new DB API methods that perform the update of a single resource usage record that returns the number of rows affected by the UPDATE statement to the caller. The UPDATE statement shall construct a WHERE condition that includes the expected usage amount. * Add a new DB API method that returns the set of quota usage records for a project, but does not call lock_mode('update'). * Change the reserve() method of the new quota DB driver to call the new DB API method to retrieve the usage records, loop over each usage record, calling the new compare-and-swap update usage amounts DB API method for each resource reservation. The method should implement a while loop that detects when the usage record was not updated and retries the update after reading the updated usage record information. The method should keep track of the records that were updated successfully. If on any retry loop of any resource, a quota exceeded is detected, then the successfully-updated usage records should be updated to undo the reservation. * Do a similar implementation for the commit() method of the new quota DB driver. * Do a similar implementation for the rollback() method of the new quota DB driver. * Provide solid reference documentation in the developer reference docs about the concurrent quota DB driver. Dependencies ============ None Testing ======= We should add a test that checks for occurrences of retry on deadlock in the Nova logs, and verify that after this fix, we do not see any more occurrences. Documentation Impact ==================== None References ========== A relevant mailing list thread about the strategy of retries: http://lists.openstack.org/pipermail/openstack-dev/2014-November/050935.html http://lists.openstack.org/pipermail/openstack-dev/2014-November/051300.html