Scheduler: Introduce HostState level locking¶
Nova FilterScheduler implementation even though inherently multi-threaded, uses no locking for access to the shared in-memory HostState data structures, that are shared between all active threads. Even though this means that most of decisions that scheduler makes under load are not internally consistent, this is not necessarily a huge issue for the basic use case, as Nova makes sure that the set resource usage policy is maintained even due to races using the retry mechanism 1. This can however cause issues in several more complex use cases. A non exhaustive list of some examples would be: high resource utilization, high load, specific types of host and resources (e.g. Ironic nodes 2 and complex resources such as NUMA topology or PCI devices).
We propose to change the scheduler code to use a lightweight transactional approach to avoid full blown locking while still mitigating some of the race conditions.
Our scheduler service is inherently multi-threaded as it currently runs an oslo-messaging RpcServer using an EventletExecutor. This means that every incoming RPC message for select_destinations will be dispatched in it’s own green thread.
Upon receiving the message, every green thread will read all ComputeNode states from the database, and potentially 3 populate the internal global data structure that holds the host states which will be used for filtering.
Further along, after choosing a host, each thread will call the HostState.consume_from_instance() method on the chosen object, which will “consume” the resources for the instance being scheduled from the chosen HostState object. This is the equivalent of what Claims code does once the request makes it to a nova-compute service, except instead of updating the ComputeNode table, it updates the scheduler service’s in memory HostState object.
However since there is no mutual exclusion of threads between the time a filter function ran and decide that the host passes, until a single host state was chosen. A number of other concurrent threads could have already updated the same host state. A classic race condition. Once we consider this, some obvious avenues for improvement arise.
When calling consume_from_instance() we are basically doing a claim of resources on the host state, that may have changed since the filter function that decided to pass the host ran. At that point we have all the information to know early if a claim is going to fail and try to choose a different host. This is roughly equivalent to retrying a transaction.
It is worth noting here that even though we may find that host seems like it will be failing, we may still want to choose it, as we don’t ever drop the resources consumed on the HostState even after we register a retry from a already chosen compute host in this refresh cycle, so it may in fact be a false negative.
There needs to be some kind of locking that is granular enough so as not to cause too much unnecessary overhead, but also to allow for more consistent handling of HostState.
There is no specific use case that this is aimed at. It is an internal refactoring aimed at improving data consistency in the scheduler, and thus overall effectiveness of placement decisions.
Firstly, it would be very useful to use the Claim logic instead (or inside) of HostState.consume_from_instance() as there is almost complete duplication there.
Next change that would be in the scope for this blueprint would be adding synchronisation primitives around accessing and updating HostState fields. A lightweight approach would be to not use any synchronisation primitives in the filters, as access to the host state is a) read-only b) usually per resource. consume_from_instance is the place where we want to make sure access is synchronized, as once the host is chosen, it will need to have resources consumed (remember - many concurrent threads could be trying to consume resources from the same HostState) and if it fails any of the “claims”, no resources should be consumed. Updating the host state with fresh values after a DB read should also be synchronized.
Final piece of the puzzle is modifying the FilterScheduler._schedule() method to take into account the failure to claim in consume_from_instance() and try the next host that passed the filters, or choose to ignore the local in memory failure and risk a retry from the compute host.
It is worth noting that this proposal only looks at fixing data consistency among threads of a single nova-scheduler process. Running several workers still means that their internal state is going to be inconsistent between updates from the database. Fixing this is outside of the scope of this proposal.
There are a number of ways we could re-design the scheduler so that the issues discussed in this spec become irrelevant. This blueprint aims to improve some obvious issues with the current implementation of the scheduler without changing the basic design.
Data model impact¶
REST API impact¶
Other end user impact¶
Even though there will be overhead of synchronisation in every request after this change which may decrease the average response time for basic workloads, I fully expect this to massively improve the performance in conditions of a large number of requests, or low overall cloud capacity (or specific resources such as Ironic hosts), as it will significantly cut down on issued retries.
Other deployer impact¶
There may be several config options deployers would need to consider. Defaults may be chosen in such a way as to not change previous behaviour.
Developers would need to understand that there is now locking going on in the scheduler, and consider this when making changes to the code, especially in case of adding additional resources.
- Primary assignee:
Modify HostState.consume_from_instance() to use the Claim logic and acquire a HostState instance-wide lock for doing so (merged in Mitaka).
Modify HostState.update_from_compute_node() to acquire a HostState instance-wide lock for updating the host state (merged in Mitaka).
Refactor Claim classes to not be directly dependent on the resource_tracker, so that they can be used in the scheduler code and possibly move out of the compute/ subtree
Modify FilterSchedule._schedule() method to expect a claim transaction failure and take appropriate action.
As is usually the case with race problems, it is notoriously difficult to come up with deterministic tests. Testing will be limited to unit tests making sure that proper synchronisation primitives are called as expected.
There may be an additional config option to turn on the transactional nature of consume_from_instance() and possibly another one to tell the scheduler to go ahead and attempt to land an instance even though a local claim failed.
The Retry mechanism works kind of like a 2PC where the instance resource usage is consumed on the in memory view the scheduler has, but is only committed to the DB when the request makes it to the chosen compute host, and under a global resource lock.
This bug <https://bugs.launchpad.net/nova/+bug/1341420> shows that this is pretty bad in case of Ironic.
I say potentially because there is a check of a timestamp to see if the HostState has actually been updated more recently than the ComputeNode record (with in flight requests not yet claimed on their compute hosts).