Pausing Charms with subordinate hacluster without sending false alerts¶
Overall, the goal is to leave “warning” alerts instead of “critical” that will help a human operator understand that all services are not completely healthy while reducing the criticality due to an on-going operation. Nrpe checks will be reconfigured once services under a maintenance operation are set back to normal (resume).
The following logic will be applied when pausing/resuming a unit:
Pausing a principal unit, pauses the subordinate hacluster;
Resuming a principal unit, resumes the subordinate hacluster;
Pausing a hacluster unit, pauses the principal unit;
Resuming a hacluster unit, resumes the principal unit;
Problem Description¶
We need to stop sending false alerts when a hacluster subordinate of an Openstack charm unit is paused or when the principal unit is also paused for maintenance. This may help operations to receive more effective alerts.
There are several charms that use hacluster and NRPE that may benefit from this:
charm-ceilometer
charm-ceph-radosgw
charm-designate
charm-keystone
charm-neutron-api
charm-nova-cloud-controller
charm-openstack-dashboard
charm-cinder
charm-glance
charm-heat
charm-swift-proxy
Pausing Principal Unit¶
If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed and keystone/0 is paused:
1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert, because apache2 service on keystone/0 is down
haproxy, apache2.service and memcached.service in keystone/0 will also alert
3) it’s possible that corosync and pacemaker have the VIP placed on the same unit at which point the service will fail as haproxy is disabled. So hacluster subordinate unit should also be paused.
Note: Services affected when pausing a principal unit may change depending on the principal charm
Pausing hacluster unit¶
Pausing hacluster set the cluster node, e.g keystone, in standby mode. A standby node will have its resources stopped (hacluster, apache2) which will fire false alerts. To solve this issue, the units of hacluster should inform the keystone unit that they are paused. A way of doing this is through the ha relation.
Proposed Change¶
Pausing Pausing Principal Unit¶
Pause action on a principal unit should share the event with its peers to modify the behavior on them (until the resume action is triggered). It should also share the status (paused/resumed) to the subordinate unit to be able to catch-up the same status.
File actions.py in the principal unit
def pause(args):
pause_unit_helper(register_configs())
# Logic added to share the event with peers
inform_peers_if_ready(check_api_unit_ready)
if is_nrpe_joined():
update_nrpe_config()
# logic added to inform hacluster subordinate unit has been paused
relid = relation_ids('ha')
for r_id in relid:
relation_set(relation_id=r_id, paused=True)
def resume(args):
resume_unit_helper(register_configs())
# Logic added to share the event with peers
inform_peers_if_ready(check_api_unit_ready)
if is_nrpe_joined():
update_nrpe_config()
# logic added to inform hacluster subordinate unit has been resumed
relid = relation_ids('ha')
for r_id in relid:
relation_set(relation_id=r_id, paused=False)
After pausing a principal unit, it will change the unit-state-{unit_name} to NOTREADY. E.g:
juju show-unit keystone/0 --endpoint cluster
keystone/0:
workload-version: 17.0.0
machine: "1"
opened-ports:
- 5000/tcp
public-address: 10.5.2.64
charm: cs:~openstack-charmers-next/keystone-562
leader: true
relation-info:
- endpoint: cluster
related-endpoint: cluster
application-data: {}
local-unit:
in-scope: true
data:
admin-address: 10.5.2.64
egress-subnets: 10.5.2.64/32
ingress-address: 10.5.2.64
internal-address: 10.5.2.64
private-address: 10.5.2.64
public-address: 10.5.2.64
unit-state-keystone-0: NOTREADY
Note: unit-state-{unit_name} field is already implemented, I’m just proposing to use this field and change the value to NOTREADY when a unit is paused and return to READY when resumed.
With every unit knowing which one is paused, it is possible to change the script check_haproxy.sh to accept a flag to warn the keystone units that are paused. The bash script is not able now to receive flags.
Check_haproxy.sh could be rewritten from Bash to Python to accept a flag to warn specific hostname (e.g. check_haproxy.py –warning keystone-0) is under maintenance.
The file nrpe.py on charmhelpers/contrib/charmsupport should have changes to first check if there is any paused unit in the cluster and then add the warning flag if necessary
def add_haproxy_checks(nrpe, unit_name):
"""
Add checks for each service in list
:param NRPE nrpe: NRPE object to add check to
:param str unit_name: Unit name to use in check description
"""
cmd = "check_haproxy.py"
peers_states = get_peers_unit_state()
units_not_ready = [
unit.replace('/', '-')
for unit, state in peers_states.items()
if state == UNIT_NOTREADY
]
if is_unit_paused_set():
units_not_ready.append(local_unit().replace('/', '-'))
if units_not_ready:
cmd += " --warning {}".format(','.join(units_not_ready))
nrpe.add_check(
shortname='haproxy_servers',
description='Check HAProxy {%s}' % unit_name,
check_cmd=cmd)
nrpe.add_check(
shortname='haproxy_queue',
description='Check HAProxy queue depth {%s}' % unit_name,
check_cmd='check_haproxy_queue_depth.sh')
When a principal unit changes the state e.g READY to NOTREADY, it’s necessary to rewrite the nrpe files on the other principal units in the cluster because, otherwise, they won’t be able to warn that a unit is under maintenance.
File responsible for hooks in the classic charms:
@hooks.hook('cluster-relation-changed')
@restart_on_change(restart_map(), stopstart=True)
def cluster_changed():
# logic added to update nrpe_config in all principal units when
# a status is changed
update_nrpe_config()
Note: In reactive charms, it might be slightly different using handlers, but the mean idea is to update_nrpe_config every time that a config in the cluster is changed. This will prevent false alerts in the other units in the cluster.
Services from Principal Unit¶
Removing the .cfg files, when the unit is paused, for those services at /etc/nagios/nrpe.d would stop sending critical errors. The downside of this approach is that it won’t have user friendly messages in Nagios saying that the specific services (apache2, memcached and etc) is under maintenance, on the other hand, it’s simpler to be achieved.
File responsible for hooks in a classic charm:
@hooks.hook('nrpe-external-master-relation-joined',
'nrpe-external-master-relation-changed')
def update_nrpe_config():
# logic before change
# ...
nrpe_setup = nrpe.NRPE(hostname=hostname)
nrpe.copy_nrpe_checks()
# added logic to remove services
if is_unit_paused_set():
nrpe.remove_init_service_checks(
nrpe_setup,
_services,
current_unit
)
else:
nrpe.add_init_service_checks(
nrpe_setup,
_services,
current_unit
)
# end of added logic
nrpe.add_haproxy_checks(nrpe_setup, current_unit)
nrpe_setup.write()
The new logic to remove those services is presented below.
File charmhelpers/contrib/charmsupport/nrpe.py
# added logic to remove apache2, memcached and etc...
def remove_init_service_checks(nrpe, services, unit_name):
for svc in services:
if host.init_is_systemd(service_name=svc):
nrpe.remove_check(
shortname=svc,
description='process check {%s}' % unit_name,
check_cmd='check_systemd.py %s' % svc
)
The status of the services will disappear from nagios after some minutes. When the resume action is used, the services are restored initially as PENDING, but after some minutes the check is done.
Pausing hacluster unit¶
File actions.py in charm-hacluster:
def pause(args):
"""Pause the hacluster services.
@raises Exception should the service fail to stop.
"""
pause_unit()
# logic added to inform keystone that unit has been paused
relid = relation_ids('ha')
for r_id in relid:
relation_set(relation_id=r_id, paused=True)
def resume(args):
"""Resume the hacluster services.
@raises Exception should the service fail to start."""
resume_unit()
# logic added to inform keystone that unit has been resumed
relid = relation_ids('ha')
for r_id in relid:
relation_set(relation_id=r_id, paused=False)
Pausing a hacluster would result in sharing a new variable paused that can be used in the principal units.
File responsible for hooks in a classic charm:
@hooks.hook('ha-relation-changed')
@restart_on_change(restart_map(), restart_functions=restart_function_map())
def ha_changed():
# Added logic to pause keystone unit when hacluster is paused
for rid in relation_ids('ha'):
for unit in related_units(rid):
paused = relation_get('paused', rid=rid, unit=unit)
clustered = relation_get('clustered', rid=rid, unit=unit)
if clustered and is_db_ready():
if paused == 'True':
pause_unit_helper(register_configs())
elif paused == 'False':
resume_unit_helper(register_configs())
update_nrpe_config()
inform_peers_if_ready(check_api_unit_ready)
# inform subordinate unit that is paused or resumed
relation_set(relation_id=rid, paused=is_unit_paused_set())
By informing peers and updating the nrpe config this will be enough to trigger the necessary logic to remove the services checks.
In a situation where the principal unit is paused, hacluster should also be paused. For this to happen, it can use the ha-relation-changed from charm-ha-cluster:
@hooks.hook('ha-relation-joined',
'ha-relation-changed',
'peer-availability-relation-joined',
'peer-availability-relation-changed',
'pacemaker-remote-relation-changed')
def ha_relation_changed():
# Inserted logic
# pauses if the principal unit is paused
paused = relation_get('paused')
if paused == 'True':
pause_unit()
elif paused == 'False':
resume_unit()
# share if the subordinate unit status
for rel_id in relation_ids('ha'):
relation_set(
relation_id=rel_id,
clustered="yes",
paused=is_unit_paused_set()
)
Alternatives¶
One alternative to services from the principal unit checks is to change systemd.py in charm-nrpe to accept flag -w like the proposal for the check_haproxy.py
This way would not be necessary to remove the .cfg files for services from the principal unit, but would be necessary to adapt the function add_init_service_checks to be able to accept services with the warning flag.
Implementation¶
Assignee(s)¶
- Primary assignee:
gabrielcocenza
Gerrit Topic¶
Use Gerrit topic “pausing-charms-hacluster-no-false-alerts” for all patches related to this spec.
git-review -t pausing-charms-hacluster-no-false-alerts
Work Items¶
charmhelpers
nrpe.py
check_haproxy.py
charm-ceilometer
charm-ceph-radosgw
charm-designate
charm-keystone
charm-neutron-api
charm-nova-cloud-controller
charm-openstack-dashboard
charm-cinder
charm-glance
charm-heat
charm-swift-proxy
charm-nrpe (Alternative)
systemd.py
charm-hacluster
actions.py
Repositories¶
No new git Repository is required.
Documentation¶
It will be necessary to document the impact of pausing/resuming a subordinate hacluster and the side effect on Openstack API charms.
Security¶
No additional security concerns.
Testing¶
Code changes will be covered by unit and functional tests. For functional tests, it will use a bundle with keystone, hacluster, nrpe and nagios.
Dependencies¶
None