https://blueprints.launchpad.net/fuel/+spec/neutron-agents-local-reports
Provide additional hints to local monitoring
Currently Pacemaker OCF scripts make desicion about liveleness of neutron agents based on the agents processes status. In some cases it makes sense to detect other cases like an agent’s inability to communicate with neutron server over message queue. In this case current OCF scripts can’t detect the failure and restart an agent.
So in order to provide additional hints to monitoring scripts ‘local reports’ are introduced. Local report is a piece of information written to a local file by neutron agent about status of certain kind of operations.
Proposed change affects L3 and DHCP agents. Agents will write status of certain operations to a local file. The list of operations include: * startup * state reports status * network (for DHCP agent) or router (for L3 agent) synchronization with neutron server
File will reside in the path, configured in neutron.conf as ‘state path’
The format of those report will include: * pid of the agent * time stamp of the status report * name of operation * status itself
Example (dhcp agent local report, note that formatting is for ease of spec readers):
Currently preferred format is json. The reason for it is that json simplifies updating local reports file by reading-updating-writing its contents. This is needed so contents is not flushed/rewritten after agent restart.
Main goal of local reports is to help with detecting different issues that are not “visible” to methods like pid monitoring.
For instance, ‘since’ field is used to estimate time which agent was in inconsistent state (out of sync with neutron-server) and restart the agent.
The fuel-library part of this feature consists of changes to the OCF scripts which will parse the local state reports files and make decision whether to restart Neutron agents or not according to timeout settings (if agent seems to be dead for a short period of time it isn’t considered as dead) and reported statuses.
This feature will be disabled by default, because currently in most cases, Neutron can handle the explained problems itself, whilst unnecessary restarts initiated by Pacemaker will slow down work of Neutron agents under high load.
None
None
None
None
None
None
None
None
None
The change is not going to be submitted to upstream, so it must be kept in a local repository for each release.
- state reports for L3 and DHCP agents
- additional code for ocf monitoring that analyses local reports
None
Testing should involve artificial interruption of rpc communication between agents and neutron server to test that ocf scripts can detect that through local reports file.
None