Pausing Charms with subordinate hacluster without sending false alerts

Overall, the goal is to leave “warning” alerts instead of “critical” that will help a human operator understand that all services are not completely healthy while reducing the criticality due to an on-going operation. Nrpe checks will be reconfigured once services under a maintenance operation are set back to normal (resume).

The following logic will be applied when pausing/resuming a unit:

  • Pausing a principal unit, pauses the subordinate hacluster;

  • Resuming a principal unit, resumes the subordinate hacluster;

  • Pausing a hacluster unit, pauses the principal unit;

  • Resuming a hacluster unit, resumes the principal unit;

Problem Description

We need to stop sending false alerts when a hacluster subordinate of an Openstack charm unit is paused or when the principal unit is also paused for maintenance. This may help operations to receive more effective alerts.

There are several charms that use hacluster and NRPE that may benefit from this:

  • charm-ceilometer

  • charm-ceph-radosgw

  • charm-designate

  • charm-keystone

  • charm-neutron-api

  • charm-nova-cloud-controller

  • charm-openstack-dashboard

  • charm-cinder

  • charm-glance

  • charm-heat

  • charm-swift-proxy

Pausing Principal Unit

If eg. 3 keystone units (keystone/0, keystone/1 and keystone/2) are deployed and keystone/0 is paused:

1) haproxy_servers on the other units (keystone/1 and keystone/2) will alert, because apache2 service on keystone/0 is down

  1. haproxy, apache2.service and memcached.service in keystone/0 will also alert

3) it’s possible that corosync and pacemaker have the VIP placed on the same unit at which point the service will fail as haproxy is disabled. So hacluster subordinate unit should also be paused.

Note: Services affected when pausing a principal unit may change depending on the principal charm

Pausing hacluster unit

Pausing hacluster set the cluster node, e.g keystone, in standby mode. A standby node will have its resources stopped (hacluster, apache2) which will fire false alerts. To solve this issue, the units of hacluster should inform the keystone unit that they are paused. A way of doing this is through the ha relation.

Proposed Change

Pausing Pausing Principal Unit

Pause action on a principal unit should share the event with its peers to modify the behavior on them (until the resume action is triggered). It should also share the status (paused/resumed) to the subordinate unit to be able to catch-up the same status.

File actions.py in the principal unit

def pause(args):
    pause_unit_helper(register_configs())

    # Logic added to share the event with peers
    inform_peers_if_ready(check_api_unit_ready)
    if is_nrpe_joined():
      update_nrpe_config()

    # logic added to inform hacluster subordinate unit has been paused
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=True)

def resume(args):
    resume_unit_helper(register_configs())

    # Logic added to share the event with peers
    inform_peers_if_ready(check_api_unit_ready)
    if is_nrpe_joined():
      update_nrpe_config()

    # logic added to inform hacluster subordinate unit has been resumed
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=False)

After pausing a principal unit, it will change the unit-state-{unit_name} to NOTREADY. E.g:

juju show-unit keystone/0 --endpoint cluster
keystone/0:
  workload-version: 17.0.0
  machine: "1"
  opened-ports:
  - 5000/tcp
  public-address: 10.5.2.64
  charm: cs:~openstack-charmers-next/keystone-562
  leader: true
  relation-info:
  - endpoint: cluster
    related-endpoint: cluster
    application-data: {}
    local-unit:
      in-scope: true
      data:
        admin-address: 10.5.2.64
        egress-subnets: 10.5.2.64/32
        ingress-address: 10.5.2.64
        internal-address: 10.5.2.64
        private-address: 10.5.2.64
        public-address: 10.5.2.64
        unit-state-keystone-0: NOTREADY

Note: unit-state-{unit_name} field is already implemented, I’m just proposing to use this field and change the value to NOTREADY when a unit is paused and return to READY when resumed.

With every unit knowing which one is paused, it is possible to change the script check_haproxy.sh to accept a flag to warn the keystone units that are paused. The bash script is not able now to receive flags.

Check_haproxy.sh could be rewritten from Bash to Python to accept a flag to warn specific hostname (e.g. check_haproxy.py –warning keystone-0) is under maintenance.

The file nrpe.py on charmhelpers/contrib/charmsupport should have changes to first check if there is any paused unit in the cluster and then add the warning flag if necessary

def add_haproxy_checks(nrpe, unit_name):
    """
    Add checks for each service in list

    :param NRPE nrpe: NRPE object to add check to
    :param str unit_name: Unit name to use in check description
    """
    cmd = "check_haproxy.py"

    peers_states = get_peers_unit_state()
    units_not_ready = [
        unit.replace('/', '-')
        for unit, state in peers_states.items()
        if state == UNIT_NOTREADY
    ]

    if is_unit_paused_set():
        units_not_ready.append(local_unit().replace('/', '-'))

    if units_not_ready:
        cmd += " --warning {}".format(','.join(units_not_ready))

    nrpe.add_check(
        shortname='haproxy_servers',
        description='Check HAProxy {%s}' % unit_name,
        check_cmd=cmd)
    nrpe.add_check(
        shortname='haproxy_queue',
        description='Check HAProxy queue depth {%s}' % unit_name,
        check_cmd='check_haproxy_queue_depth.sh')

When a principal unit changes the state e.g READY to NOTREADY, it’s necessary to rewrite the nrpe files on the other principal units in the cluster because, otherwise, they won’t be able to warn that a unit is under maintenance.

File responsible for hooks in the classic charms:

@hooks.hook('cluster-relation-changed')
@restart_on_change(restart_map(), stopstart=True)
def cluster_changed():
    # logic added to update nrpe_config in all principal units when
    # a status is changed
    update_nrpe_config()

Note: In reactive charms, it might be slightly different using handlers, but the mean idea is to update_nrpe_config every time that a config in the cluster is changed. This will prevent false alerts in the other units in the cluster.

Services from Principal Unit

Removing the .cfg files, when the unit is paused, for those services at /etc/nagios/nrpe.d would stop sending critical errors. The downside of this approach is that it won’t have user friendly messages in Nagios saying that the specific services (apache2, memcached and etc) is under maintenance, on the other hand, it’s simpler to be achieved.

File responsible for hooks in a classic charm:

@hooks.hook('nrpe-external-master-relation-joined',
            'nrpe-external-master-relation-changed')
def update_nrpe_config():
    # logic before change
    # ...

    nrpe_setup = nrpe.NRPE(hostname=hostname)
    nrpe.copy_nrpe_checks()

    # added logic to remove services
    if is_unit_paused_set():
        nrpe.remove_init_service_checks(
            nrpe_setup,
            _services,
            current_unit
        )

    else:
        nrpe.add_init_service_checks(
            nrpe_setup,
            _services,
            current_unit
        )

    # end of added logic

    nrpe.add_haproxy_checks(nrpe_setup, current_unit)
    nrpe_setup.write()

The new logic to remove those services is presented below.

File charmhelpers/contrib/charmsupport/nrpe.py

# added logic to remove apache2, memcached and etc...
def remove_init_service_checks(nrpe, services, unit_name):
    for svc in services:
        if host.init_is_systemd(service_name=svc):
            nrpe.remove_check(
                shortname=svc,
                description='process check {%s}' % unit_name,
                check_cmd='check_systemd.py %s' % svc
            )

The status of the services will disappear from nagios after some minutes. When the resume action is used, the services are restored initially as PENDING, but after some minutes the check is done.

Pausing hacluster unit

File actions.py in charm-hacluster:

def pause(args):
    """Pause the hacluster services.
    @raises Exception should the service fail to stop.
    """
    pause_unit()
    # logic added to inform keystone that unit has been paused
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=True)


def resume(args):
    """Resume the hacluster services.
    @raises Exception should the service fail to start."""
    resume_unit()
    # logic added to inform keystone that unit has been resumed
    relid = relation_ids('ha')
    for r_id in relid:
        relation_set(relation_id=r_id, paused=False)

Pausing a hacluster would result in sharing a new variable paused that can be used in the principal units.

File responsible for hooks in a classic charm:

@hooks.hook('ha-relation-changed')
@restart_on_change(restart_map(), restart_functions=restart_function_map())
def ha_changed():

    # Added logic to pause keystone unit when hacluster is paused
    for rid in relation_ids('ha'):
        for unit in related_units(rid):
            paused = relation_get('paused', rid=rid, unit=unit)
            clustered = relation_get('clustered', rid=rid, unit=unit)
            if clustered and is_db_ready():
                if paused == 'True':
                    pause_unit_helper(register_configs())

                elif paused == 'False':
                    resume_unit_helper(register_configs())

                update_nrpe_config()
                inform_peers_if_ready(check_api_unit_ready)
                # inform subordinate unit that is paused or resumed
                relation_set(relation_id=rid, paused=is_unit_paused_set())

By informing peers and updating the nrpe config this will be enough to trigger the necessary logic to remove the services checks.

In a situation where the principal unit is paused, hacluster should also be paused. For this to happen, it can use the ha-relation-changed from charm-ha-cluster:

@hooks.hook('ha-relation-joined',
            'ha-relation-changed',
            'peer-availability-relation-joined',
            'peer-availability-relation-changed',
            'pacemaker-remote-relation-changed')
def ha_relation_changed():
    # Inserted logic
    # pauses if the principal unit is paused
    paused = relation_get('paused')
    if paused == 'True':
        pause_unit()
    elif paused == 'False':
        resume_unit()

    # share if the subordinate unit status
    for rel_id in relation_ids('ha'):
        relation_set(
            relation_id=rel_id,
            clustered="yes",
            paused=is_unit_paused_set()
        )

Alternatives

One alternative to services from the principal unit checks is to change systemd.py in charm-nrpe to accept flag -w like the proposal for the check_haproxy.py

This way would not be necessary to remove the .cfg files for services from the principal unit, but would be necessary to adapt the function add_init_service_checks to be able to accept services with the warning flag.

Implementation

Assignee(s)

Primary assignee:

gabrielcocenza

Gerrit Topic

Use Gerrit topic “pausing-charms-hacluster-no-false-alerts” for all patches related to this spec.

git-review -t pausing-charms-hacluster-no-false-alerts

Work Items

  • charmhelpers

    • nrpe.py

    • check_haproxy.py

  • charm-ceilometer

  • charm-ceph-radosgw

  • charm-designate

  • charm-keystone

  • charm-neutron-api

  • charm-nova-cloud-controller

  • charm-openstack-dashboard

  • charm-cinder

  • charm-glance

  • charm-heat

  • charm-swift-proxy

  • charm-nrpe (Alternative)

    • systemd.py

  • charm-hacluster

    • actions.py

Repositories

No new git Repository is required.

Documentation

It will be necessary to document the impact of pausing/resuming a subordinate hacluster and the side effect on Openstack API charms.

Security

No additional security concerns.

Testing

Code changes will be covered by unit and functional tests. For functional tests, it will use a bundle with keystone, hacluster, nrpe and nagios.

Dependencies

None