The central agent’s job is polling resources for information, transforming that information into samples and passing the samples on to the Collector Agent.
This specification proposes an implementation of coordination between multiple Central Agents, which could then dynamically distribute the workload between them, providing scalability and high availability.
Currently, each Central Agent retrieves a set of resources and polls all of them. If we have multiple Central Agents running, they all poll the same set of resources, which prevents us from scaling out horizontally.
At the start of each polling interval, each of the pollsters retrieves a list of resources to poll from its Discovery plugin (configured in the pipeline or a default one). This makes the discovery process a great place to implement the coordination and partitioning logic, while the pollsters themselves can remain in blissful ignorance of anything going on.
The basic idea is to use the tooz  library for group membership and hashing to assign resources to active Central Agents.
Determining the resources we’re responsible for
We have a list of resources and get a list of active agents from tooz. We then get our assigned resources as follows:
our_key = sorted(agents).index(our_agent_uuid) our_resources =  for resource in resources: key = hash(resource) mod len(agents) if key == our_key: our_resources.append(resource) # or more pythonic our_resources = [r for r in resources if hash(r) mod len(agents) == our_key]
In essence we hash the resources to <number of Central Agents> of buckets and only poll the resources that fall into our bucket. A good hash function  ensures that the resources are evenly distributed to the active Central Agents.
The pollster’s Discovery plugin (be it a Compute Discovery, Hardware Discovery, etc.) provides the scope its resources are a part of.
For example, if a Discovery plugin isn’t constrained to a subset of resources, as is the case for most Discovery plugins, then it should simply join the global group of unconstrained Discovery plugins.
If, on the other hand, the resources that the Discovery plugin can discover are constrained, like in the case of Compute Discovery, then the group name should reflect their scope. An example of this would be ‘compute-<hostname>-discovery’. This way only the pollsters that are polling the same host will share their workload between them.
What happens when we start another agent (or stop an existing one)?
tooz allows us to register a callback that is called when a member joins or leaves the group. It keeps track of member liveness using a heartbeat mechanism.
When a member joins or leaves the group, this is what happens to:
Generalizing the implementation for re-use
The need for coordinated assignment of “things” (resources, alarms, ...) to agents is not unique to the Central Agent. Currently, the Alarm Evaluator could make use of it as well to have multiple Alarm Evaluators running, each evaluating their share of alarms.
This functionality could be captured in a PartitionCoordinator class, which agents could use like:
partition_coordinator = PartitionCoordinator(group='alarm') partition_coordinator.start() every evaluation_interval: all_alarms = get_all_alarms() my_alarms = partition_coordinator.get_my_subset(all_alarms) for a in my_alarms: evaluate(a)
The actual change-over of the alarm partitioning coordination to the proposed approach will be tracked in a separate blueprint.
or in the case of the central agent:
partition_coordinator = PartitionCoordinator(group='central_agent') partition_coordinator.start() every polling_interval: all_resources = discover_resources() my_resource = partition_coordinator.get_my_subset(all_resources) for r in my_resources: poll(r)
Fabio Gianetti’s approach .
Fabio’s approach uses source<->agent assignments in the database for figuring out what to poll and a heartbeat in combination with additional agents listening for that heartbeat for failure detection.
In contrast, this proposal uses tooz for failure detection (via heartbeats as well). Additionally, the resource allocation is more dynamic since the resources are assigned to agents evenly at any point in time. It is also more lightweight since we don’t need to keep an explicit resource<->agent mapping in the database, but use hashing instead.
Another approach would be to use distributed locking provided by tooz. Before a pollster would poll a resource, it’d need to acquire its lock. Pollsters contend for the locks and whoever gets the lock, polls the resource.
The downside of this approach is the overhead of distributed locking. Acquiring a distributed lock incurs a cost (time, network traffic). When using distributed locks for resource contention, this cost is incurred per-resource. Whereas in the approach with group membership, the coordination cost is incurred only when a member joins/leaves the group, the frequency of which is negligible compared to the amount of resources.
If deployers want to use multiple central agents, they will need to deploy one of the tooz backends (ZooKeeper, memcached, possibly just an AMQP broker soon)
The implementation should be tested with unit tests.
Operator’s manual should explain the process and properties of running multiple Central Agents.