Event Alarm Timeout

https://blueprints.launchpad.net/aodh/+spec/event-alarm-timeout

This BP adds timeout mechanism for event-alarm. End users can specify a timeout, 0 (no timeout) by default, for each event-alarm. The alarm status becomes ‘TIMEOUT’ after timeout reached without receiving desired event.

Problem description

After event-alarm were introduced in Liberty, end users or operators could set alarm for desired event and get alarmed when it receive them. But in some circumstances, operator want to know otherwise: when desired event is not received.

For example, “compute.instance.create.end” is the final event sent to message bus to indicate success of instance creation. Not receiving it after a long time is a signal of creation failure, in which operator should be notified. Unfortunately, current event-alarm doesn’t support it.

Proposed change

When creating event-alarm, a new parameter ‘timeout’ is proposed to define a expiry time length, so that alarm gets fired when desired event is not received in expected time. Otherwise, alarm status becomes ‘TIMEOUT’.

Currently, 3 states are supported in alarm: ‘UNKNOWN’, ‘ALARM’ and ‘OK’, so a new state ‘TIMEOUT’ will be added to reflect timeout situation.

For timeout implementation, an ‘alarm.timeout.start’ notification is sent out by the AODH api process after the event-alarm is created. After receiving it, evaluator asks its timeout thread/process to handle timeout request. In this way, avoid new process in AODH api and make all alarm handling jobs inside evaluator.

Synchronization handling is critical in evaluator, as both evaluators original process and timeout process can change status for same alarm. To avoid complicated lock, timeout process just send a ‘alarm.timeout.end’ event with related alarm/project id to ‘alarm.all’ topic, where evaluator original process handle it along with desired event.

Each ‘alarm.timeout.*’ event should carry enough info in payload, including alarm_id, project_id, timeout, and desired event. These info could be used for evaluation and future partitioning infrastructure.

Each evaluator has only one timeout thread, which gets timeout requests from evaluator into a queue. Small footprint of timeout process is guaranteed, as it always sleeps unless one timeout happens. After wake up, it only does 2 things:

  • sends out ‘alarm.timeout.end’ event
  • pick up nearest timeout request and start sleeping for it

The final alarm status depends on the order of events. If ‘timeout.end’ event comes first, alarm status becomes ‘TIMEOUT’ and following desired event is ignored. Otherwise, alarm status becomes ‘ALARM’ and following ‘timeout.end’ event is ignored. In this way, synchronization is well handled with new ‘timeout.end’ event and simple logic in evaluator.

After adding timeout, event-alarm state transition changes as following:

  • EVENT - desired event arrive
  • TIMEOUT - timeout happen

Registered action is only triggered when:

  • transition between different states, like UNKNOWN => ALARM
  • receiving desired event again at ALARM state if repeat_actions is true
+-----------+            +---------+  EVENT
|           |            |         +---------+
|  UNKNOWN  +------------> TIMEOUT |         |
|           |  TIMEOUT   |         <---------+
+-----+-----+            +---------+
      |
      |
      |                  +---------+
      |                  |         +---------+
      +------------------>  ALARM  |         |
              EVENT      |         <---------+
                         +-+-----^-+  TIMEOUT
                           |     |
                           +-----+
                            EVENT

Also need changes in DB layer to get all the event-alarm with timeout. In following cases, we need restart timeout thread if the alarm has timeout, and previous timeout thread already exit:

  • reset the alarm state to ‘UNKNOWN’
  • enable the alarm via the ‘enabled’ attribute

Alternatives

There is another BP from Igor to illustrate a different high level design and usage model. Pls. check https://blueprints.launchpad.net/ceilometer/+spec/timeout-event-alarm-evaluator

It adds a new alarm type ‘event_timeout’ to track sequence of events, like: “compute.instance.create.start”, “compute.instance.create.end”. So this new alarm includes 2 events: start and end – only after receiving “start” event, timeout is created for “end” event. It is not as simple and flexible as this BP, which just adds timeout.

To handle the synchronization, an alternative is to lock critical section between evaluator timeout and original thread, so each of them need handle alarm data structure. But it’s tricky and bug prone.

Data model impact

None

REST API impact

API needs minor changes to add ‘timeout’ to ‘event_rule’. One simple event-alarm definition is as following:

{"alarm_actions": ["log://"],
 "ok_actions": ["log://"],
 "alarm_id": null,
 "enabled": true,
 "name": "alarm01",
 "repeat_actions": false,
 "state": "insufficient data",
 "event_rule": {"query": [{"field": "traits.name",
                           "type": "string",
                           "value": "cirros-0.3.4-x86_64-uec-ramdisk",
                           "op": "eq"}],
                "event_type": "image.update",
                "timeout": 10},
 "type": "event"}

Security impact

None

Pipeline impact

None

Other end user impact

End user need to know a new ‘timeout’ parameter when create event alarm.

Performance/Scalability Impacts

No obvious performance issue, because of small footprint of timeout thread. No obvious scalability issue, as timeout handling is done in evaluator who will support good partition.

Other deployer impact

None

Developer impact

None

Implementation

Assignee(s)

Primary assignee:
edwin-zhai

Work Items

  • Add new parameter ‘timeout’ for event-alarm creation in aodh-client
  • Add new alarm state ‘TIMEOUT’ for timeout expired alarms
  • Add new interface in aodh-client and DB layer to get all event-alarm with timeout
  • Modify AODH api’s event-alarm creating code to send out ‘alarm.timeout.start’ notification
  • Modify AODH event-alarm evaluator so that:
    • spawn a new timeout thread to handle all timeout requests
    • timeout thread works in a loop of sleeping for timeout seconds then sending out ‘alarm.timeout.end’ event
    • set related alarm status as ‘TIMEOUT’ when receive ‘alarm.timeout.end’ event.
  • Add extra action to restart timeout thread when reset alarm state to ‘UNKNOWN’ or enable the alarm if previous timeout thread already exit

Future lifecycle

To be maintained by edwin-zhai for bug fixing and enhancement.

In future, we need timeout thread disaster-recovery capability, that is, no loss of timeout info when evaluator crash. Need store pending timeout requests in DB, and feed evaluator when restarting.

Dependencies

None

Testing

Add new test case besides current event-alarm test to cover timeout

Documentation Impact

Administrator Guide and Installation Guide in OpenStack Manuals should be updated to describe usage of ‘timeout’ parameter.

References

Blueprint Timeout mechanism for Event Alarm Evaluator https://blueprints.launchpad.net/ceilometer/+spec/timeout-event-alarm-evaluator