Proactive RCA fault model¶
https://blueprints.launchpad.net/vitrage/+spec/proactive-rca
This blueprint proposes a solution for proactive RCA. It aims to be an umbrella for all related blueprints.
Problem description¶
Currently, Vitrage relies on the monitored events for root cause analysis. It may deduce alarms on virtual resources, but it is always in a confirmed state. But in the real world, things could be more complicated. Suppose there are two possible causes (A or B) for fault C. When fault C is monitored, it is suspicious that A or B could happen and be the root cause. We need a way to take action in such case to find the root cause more proactively.
Proposed change¶
A fault model with deductive reasoning is proposed here to resolve the problem above.
Fault model¶
Given the template defined for the fault model above.
The underlying entity graph could be breakdown as
Deductive reasoning¶
Suppose we received a series of events in the following order:
t0: initial status, no faults
t1: fault C active monitored
t2: fault A active and fault B inactive is reported from diagnose action
t3: fault C inactive monitored
t4: fault A inactive returned from diagnostic action
Let’s illustrate the deductive reasoning in graphs with following legend.
cluster: aggregate fault
nodes: raw fault
edges:
dashed: the target state is deduced from source state by Vitrage
dotted: the target state is updated by diagnose action triggered by source state
border:
dashed: suspect
colors:
lightgrey: fault in undefined state
red: fault in confirmed state
labels:
Fault A
: aggregateda_v
: deduced by Vitragea_m
: monitored
t0: initial status¶
All faults undefined, no reasoning
t1: downstream fault monitored, upstream fault deduced¶
Vitrage deduces that fault A and fault B are active in suspect state
execute diagnose action to confirm suspect state
t2: diagnose action executed, upstream fault confirmed¶
monitored states updated by diagnose action
as a side-effect, Vitrage will deduce that fault C is active in parallel of monitored fault C
t3: downstream recovery monitored, upstream recovery deduced¶
from the fault model, we can deduce fault A and fault B are inactive
execute diagnose action to resolve status inconsistency between fault A deduced (inactive) and monitored (active)
t4: upstream recovery confirmed¶
Changes required¶
The main change would be allowing raise alarms in suspect status which can be used to trigger a diagnostic action.
Add support for diagnose action, like:
an immediate pull from data source
force a push from data source
launch external tools to get state and post events to Vitrage API
Suspect states for alarm
for deduced alarm, it is suspect if the source is from downstream
for aggregated alarm, it is suspect if inconsistency is detected among underlying alarms (deduced and monitored)
Specially, it is not suspect if newer monitored state is different from old suspect deduced state. Because a suspect state means it could be either active or inactive. So it is reasonable to trust the latest update from monitor.
However, when we bring in proactive RCA, the entity numbers in the graph may grow a lot. We shall need to create deduced alarm for each monitored alarm and set suspect state in some condition. The relationships (edges) will also grow. So there are some additional work to be done to improve user experience, such as:
Aggregate underlying entity graph to simplify user view
Add new API for querying aggregated entity graph
Simplify the template definition by defining fault model
Backward compatibility¶
A fully functional proactive RCA fault model requires every fault can be monitored with a diagnose action. But the deductive reasoning procedure is also applicable for fault model lacking diagnose action or missing monitoring on some fault.
For example, if there is no diagnose action for confirming Fault A, the deductive reasoning will still continue once the status of Fault A get updated passively from monitor.
If there is no monitor for Fault A, then it will stay in suspect status. It helps the user to narrow down the scope of root cause to make manual investigation easier.
Implementation¶
Assignee(s)¶
- Primary assignee:
yujunz
- Other contributors:
no
Work Items¶
See dependent blueprints.
Dependencies¶
The implementation of proactive RCA depends on several blueprints. They will be listed below once got approved.
Testing¶
See dependent blueprints.
Documentation Impact¶
See dependent blueprints.