Vitrage-graph fast failover¶
https://blueprints.launchpad.net/vitrage/+spec/vitrage-fast-failover
- vitrage-graph high availability should meet these requirements:
Single active instance of vitrage-graph (managed by pacemaker).
Initialize quickly upon failover without requesting updates.
In case of a long downtime, vitrage-graph startup will request collector updates
Problem description¶
Vitrage-graph is active standby. Currently on a failover, vitrage-graph needs to pull all the data again from the collector data-sources. This takes a considerable amount of time, in which data is inconsistent. As we wish to continue working with an in-memory graph (due to performance), vitrage-graph service will remain active-standby. Therefore, downtime must be minimized in failover events.
Proposed change¶
after every get_all, vitrage-graph stores a full entity graph snapshot in the db, so the majority of events do not need to be replayed.
Vitrage-graph sends each processed event to vitrage-persistor so these are stored in the order of handling.
Upon init vitrage-graph queries the db table graph_snapshots, fetching the latest entry, it will be used if it is not older than snapshot_interval.
- Init with a snapshot - on failover
Unpickle stored snapshot to get the graph.
Run the processor on all the events (from events table) that occurred after the snapshot.
Enable the evaluators.
Process all the events that are waiting in the message bus.
- Init without a snapshot - a fresh start (This is the current behaviour).
Start with a new empty graph.
RPC to Collector to run get_all for all drivers, then process the events.
Process all the events that are waiting in the message bus.
Enable evaluator and iterate all graph.
Alternatives¶
Using a persistent graph database can improve vitrage-graph high availability as fail-over will be quick due to running active-active. This may be a preferred solution in terms of high availability, but overall, when comparing performance compared to in-memory networkx, the degradation is not reasonable
Data model impact¶
May require minor changes, TBD.
REST API impact¶
None
Versioning impact¶
None
Other end user impact¶
None
Deployer impact¶
This will be enabled by default. Deployer may disable in by adding the following to vitrage.conf [persistancy] enable_persistancy=false
Developer impact¶
None
Horizon impact¶
None
Implementation¶
Assignee(s)¶
- Primary assignee:
idan-hefetz
- Other contributors:
None
Work Items¶
None
Dependencies¶
None
Testing¶
Additional tempest will be added for fail-over, as persistence is already covered by existing tempest. Unit tests will not be affective here as changes are mostly in the init process and scheduler. This feature mostly reuses existing (tested) functionality.
Documentation Impact¶
None
References¶
None