This specification proposes to add distributed periodic tasks.
Currently periodic tasks are executed on each engine simultaneously, so that the amount of time between those tasks can be very small and load between engines is not balanced. Distribution of periodic tasks between engines can deal with both problems.
Also currently we have 2 periodic tasks that make clusters cleanup. Clusters terminations in these tasks are performed in cycle with direct termination calls (bypassing OPS). This approach leads to cessation of periodic tasks on particular engine during cluster termination (it harms even more if some problems during termination occurs).
Moreover distribution of periodic tasks is important for peridic health checks to prevent extra load on engines.
This change consists of two things:
Distributed periodic tasks will be based on HashRing implementation and Tooz library that provides group membership support for a set of backends . Backend will be configured with periodic_coordination_backend option.
There will be only one group called sahara-periodic-tasks. Once a group is created, any coordinator can join the group and become a member of it.
As an engine joins the group and builds HashRing, hash of its ID is being made, and that hash determines the data it’s going to be responsible for. Everything between this number and one that’s next in the ring and that belongs to a different engine, is now belong to this engine.
The only remaining problem is that some engines will have disproportionately large space before them, and this will result in greater load on them. This can be ameliorated by adding each server to the ring a number of times in different places. This is achieved by having a replica count (will be 40 by default).
HashRing will be rebuilt before execution of each periodic task to reflect actual state of coordination group.
sahara.service.coordinator.py module and two classes will be added:
Coordinator class will contain basic coordination and grouping methods:
class Coordinator(object): def __init__(self, backend_url): ... def heartbeat(self): ... def join_group(self, group_id): ... def get_members(self, group_id): ...
HashRing class will contain methods for ring building and subset extraction:
class HashRing(Coordinator): def __init__(self, backend_url, group_id, replicas=100): ... def _build_ring(self): ... def get_subset(self, objects): ...
Now we have 4 periodic tasks. For each of them will be listed what exactly will be distributed:
Also we will have a periodic task for health checks soon. Health check task will be executed on clusters and this clusters will be split among engines in the same way as job executions for update_job_statuses.
If a coordination backend is not provided during configuration, periodic tasks will be launched in an old-fashioned way and HashRing will not be built.
If a coordination backend is provided, but configured wrong or not accessible, engine will not be started with corresponded error.
If a connection to the coordinator will be lost, periodic tasks will be stopped. But once connection is established again, periodic tasks will be executed in the distributed mode.
In order to keep the connection to the coordination server active, heartbeat method will be called regularly (every heartbeat_timeout seconds) in a separate thread.
Configurable number of threads (each thread will be a separate member of a group) performing periodic tasks will be launched.
Coordination backend should be configured.
Tooz package 
Unit tests, enabling distributed periodics in intergration tests with one of the supported backends (for example, ZooKeeper) and manual testing for all available backends  supported by Tooz library.
Sahara REST API documentation in api-ref will be updated.