Support basic high availability¶
Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/congress/+spec/basic-high-availability
Congress needs to support high availability (HA) for API requests so that a client can continue to make successful API requests to Congress even if a congress server becomes unavailable. This proposal describes a basic HA solution that replicates the entire Congress server as-is. Each replica runs the policy engine, contains all the table data, and runs the datasource drivers.
Problem description¶
Today, Congress runs as a single standalone server. That single server is responsible for handling all API queries and is a single point of failure. If the server fails, it will cause downtime for clients that integrate Congress.
Proposed change¶
- This spec proposes to
Replicate the entire Congress server
Use an off the shelf load balancer to distribute requests to the replicas and avoid failed replicas
Write API calls modify the database
Each replica periodically checks the database for changes to policy or datasources
Alternatives¶
A more advanced design separates the policy engine from the datasources, and replicates the policy engine N times, but uses a master-standby configuration for the datasource driver. This way only the master datasource driver talks to datasources thus reducing the load on datasources. The datasource communicates incoming data changes to the replica over a message bus. This change would require more code changes to separate the engine from the datasources and to change how the messaging bus works.
Yet another proposal is to funnel all datasource updates to a central machine which would precompute materialized views of all tables. This has the advantage of giving the replicated Congress API consistency, but it relies on a single machine to compute the materialized views, and relies on materializing the views which stores all intermediate table content in memory, which can consume an unmanagable amount of memory with many intermediate tables. This alternate proposal would also require a significant amount of code changes.
Policy¶
None
Policy actions¶
None
Data sources¶
We need to ensure that each data service (such as Nova or Neutron) can accept and handle requests from more than one datasource driver instance at the same time since each replica will be fetching data from each data service. In other words, if there are N replicas, then each data service must respond with all the data separate N times, and the data service must be able to cope with that higher request load.
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
Two API calls may return different data if a different replica serves each API call because both the data from the datasources and the policy rules may be out of sync between two replicas. The rate at which each replica checks the database for updates can limit the problem for policy rules, but data skew will still affect the replicas.
Performance impact¶
This change should improve throughput for the Congress server since there can be multiple replicas instead of a single server. However, there may be an impact on datasources since each replica will be requesting data from the datasource. The period database requests to check for updates should have a minimal impact on performance.
Other deployer impact¶
None
Developer impact¶
All shared state must be stored in the database, and periodically checked at all replicas.
Implementation¶
Assignee(s)¶
- Primary assignee:
Work items¶
Add period check for database updates and add a test
Add a test that starts two replicas and queries both
Dependencies¶
None
Testing¶
Start two replicas, using the same database. Write a policy change on one replica and check that the policy change occurs on the second replica.
Start two replicas, and kill one. Make sure the second replica can still serve requests. Start first replica again and make sure it can still serve requests.
Documentation impact¶
We should add a description of how to configure Congress in HA mode, with a load balancer and a shared database.
References¶
None