Neutron agents spawn external detached processes which run unmonitored, if anything happens to those processes neutron won’t take any action, failing to provide those services reliably.
We propose monitoring those processes, and taking a configurable action, making neutron more resilient to external failures.
When a ns-metadata-proxy dies inside an l3-agent , subnets served by this ns-metadata-proxy will have no metadata until there are any changes to the router, which will recheck the metadata agent liveness.
Same thing happens with the dhcp-agent  and also in lbaas and vpnaas agents.
This is a long known bug, which generally would be triggered by bugs in dnsmasq, or the ns-metadata-proxy, and specially critical on big clouds and HA environments.
I propose to monitor the spawned processes using the neutron.agent.linux.external_process.ProcessMonitor class, which relies on the ProcessManager to check liveness periodically.
If a process that should be active is not, it will be logged, and we could take any of the following admin configured actions, in the configuration specified order.
In future follow ups, we plan to implement a notify action to the process manager when the corresponding piece lands in oslo .
Examples of configurations could be:
check_child_processes_period = 0
check_child_processes_action = respawn check_child_processes_period = 60
check_child_processes_action = notify check_child_processes_period = 60
check_child_processes_action = exit check_child_processes_period = 60
This feature will be enabled by default (60 seconds), and default action will be ‘respawn’.
Some extra periodic load will be added by checking the underlying children. Locking of other green threads will be diminished by starting a green thread pool for checking the children. A semaphore is introduced to avoid several check cycles from starting concurrently.
As there were concerns on polling /proc/$pid/cmdline, I implemented a simplistic benchmark:
i=10000000 while i>0: f = open ('/proc/8125/cmdline','r') f.readlines() i = i - 1
Please note that the cmdline file is addressed by kernel functions  in memory and does not rely on any I/O over a block device, that means there is no cache speeding up the read of this file which would invalidate this benchmark.
root@ns316109:~# time python test.py real 0m59.836s user 0m23.681s sys 0m35.679s
That means, 170.000 reads/s using 1 core / 100% CPU on a 7400 bogomips machine.
If we had to check 1000 children processes we would need 1000/170000 = 0.0059 seconds plus the overhead of the intermediate method calls and the spawning of greenthreads.
I believe ~ 6ms CPU usage to check 1000 children is rather acceptable, even though the check interval is tunable, and it’s disabled by default to let the deployers balance the performance impact with the failure detection latency.
Polling isn’t ideal, but the alternatives aren’t either, and we need a solution for this problem, specially for HA environments.
No effect on IPv6 expected here.
People implementing their own external monitoring of the subprocesses, may need to migrate into the new solution, taking advantage of the exit method, or a later notify one when that’s available.
Developers which spawn external processes may start using ProcessMonitor instead of using ProcessManager directly.
This change has been discussed several times on the mailing list, IRC, and previously accepted for Juno, but didn’t make it to the deadline on time. It’s something desired by the community, as it makes neutron agents more resilient to external failures.
Adding brian-haley as I’m taking a few of his ideas, and reusing partly his work on .
Notes: a notify action was planned, but it’s depending on a new oslo feature, this action can be added later via bug process once the oslo feature is accepted and implemented.
The notify action depends on the implementation of , but all the other features/actions can be acomplished without that.
Tempest tests are not capable of doing arbitrary execution of command in the network nodes (killing processes for example). So we can’t use tempest to check this without implementing some sort of fault injection in tempest.
Functional testing is used to verify the ProcessMonitor class, in charge of the core functionality of this spec.
The new configuration options will have to be documented per agent.
This are the proposed defaults:
check_child_processes_action = respawn check_child_processes_period = 0
|||DHCP agent implementation: https://review.openstack.org/#/c/115935/|
|||L3 agent implementation: https://review.openstack.org/#/c/114931/|
|||Dhcp agent dying children bug: https://bugs.launchpad.net/neutron/+bug/1257524|
|||L3 agent dying children bug: https://bugs.launchpad.net/neutron/+bug/1257775|
|||Brian Haley’s implementation for l3 agent https://review.openstack.org/#/c/59997/|
|||(1, 2) Oslo service manager status notification spec http://docs-draft.openstack.org/48/97748/3/check/gate-oslo-specs-docs/ef96358/doc/build/html/specs/juno/service-status-interface.html]|
|||Oslo spec review https://review.openstack.org/#/c/97748/|
|||Old agent service status blueprint https://blueprints.launchpad.net/neutron/+spec/agent-service-status|