Rolling upgrades¶
https://blueprints.launchpad.net/glance/+spec/rolling-upgrades
This spec provides a gap analysis of what is required for the Glance project to assert the various rolling upgrade tags and specifies the actions necessary to close the gaps.
The goal for Ocata is to assert the assert:supports-zero-downtime-upgrade
tag, with a stretch goal of asserting the
assert:supports-zero-impact-upgrade
tag. Given that asserting these tags
depends on completion of another feature (see spec proposal [GSP1]), the
backup goal is to assert the assert:supports-rolling-upgrade
tag. In any
case, Glance will make progress on the upgrade front in this cycle.
Problem description¶
There are currently four upgrade tags:
assert:supports-upgrade
[GOV3]assert:supports-rolling-upgrade
[GOV1]assert:supports-zero-downtime-upgrade
[GOV4]assert:supports-zero-impact-upgrade
[GOV5]
supports-upgrade¶
The assert:supports-upgrade
tag [GOV3] has been asserted for Glance
[GOV0].
supports-rolling-upgrade¶
The requirements for asserting the assert:supports-rolling-upgrade
tag are
listed in [GOV1]. They are:
The project is already tagged as
type:service
[GOV2].Glance status: done
The project has already successfully asserted the
assert:supports-upgrade
tag [GOV3].Glance staus: done
The project has a defined plan that allows operators to roll out new code to subsets of services, eliminating the need to restart all services on new code simultaneously.
Glance status: need to define a plan.
More detail about the required plan, as described in [GOV1]:
This plan should clearly call out the supported configuration(s) that are expected to work, unless there are no such caveats. This does not require complete elimination of downtime during upgrades, but rather reducing the scope from “all services” to “some services at a time.” In other words, “restarting all API services together” is a reasonable restriction.
The Glance services consist of the Glance API server and the optional Glance Registry API server. The Glance Registry API server is not intended to be exposed to end users or other OpenStack services; it is expressly designed for internal Glance use only.
(Note that it’s OK for there to be specific configurations under which a rolling upgrade is expected to work. In particular, it’s likely that we will require deployments using the optional Glance registry to run it on dedicated nodes.)
Glance has had healthcheck middleware that can be used to signal to a load balancer that an API node is out of service since the Liberty release [GLA2]. This can be leveraged to take Glance nodes running “old” code out of rotation while nodes running “new” code are brought in.
Note that while this proposal will allow a mixed deployment of API versions to run simultaneously, it does not envision that this will include multiple versions of the API running on the same node simultaneously. In other words, we do not intend to support a scenario in which “new” API code is deployed to a node while “old” Glance processes are running on that node. Instead, we expect operators to allow the “old” nodes to drain completely and all processes running the “old” code to be stopped before the “new” code is deployed to that node. (If the Glance nodes are VMs, a completely drained node could simply be deleted and be replaced by a fresh VM containing the “new” code.)
Full stack integration testing with services arranged in a mid-upgrade manner is performed on every proposed commit to validate that mixed-version services work together properly.
Glance status: needs to be implemented (but note that the tests required for the next tag would more than satisfy this requirement).
supports-zero-downtime-upgrade¶
The assert:supports-zero-downtime-upgrade
tag indicates that a project
supports minimal rolling upgrade capabilities in such a way that no disruptions
to API availability occur during the upgrade.
The requirements for asserting this tag are listed in [GOV4]. They are:
The project is already tagged as
type:service
[GOV2].Glance status: done
The project has already successfully asserted both the
assert:supports-upgrade
andassert:supports-rolling-upgrade
tags.Glance status: the
assert:supports-rolling-upgrade
tag has not yet been asserted. See the previous section for what’s required.
Services must completely eliminate API downtime of the control plane during the upgrade.
Glance status: The key issue is how to handle database changes required for release N while release N-1 code is still running. This is addressed by another spec, “Database strategy for rolling upgrades” [GSP1].
Services must be capable of receiving and handling requests throughout the upgrade process with a normal success rate. Services must prevent regression by implementing a zero-downtime gate job wherein both a new version of the service and an old version of the service are run concurrently.
Glance status: needs to be implemented.
supports-zero-impact-upgrade¶
The assert:supports-zero-impact-upgrade
tag indicates that a project
supports both rolling upgrade capabilities and a zero-downtime upgrade (as
described above) in such a way that no perceivable API performance penalty
occurs during the upgrade.
The requirements for asserting this tag are listed in [GOV5]. They are:
The project is already tagged as
type:service
.Glance status: done
The project has already successfully asserted the
assert:supports-upgrade
,assert:supports-rolling-upgrade
, andassert:supports-zero-downtime-upgrade
tags.Glance status: see the previous sections.
Services must completely eliminate any perceivable performance penalty during the upgrade process. Operators should not expect any portion of the upgrade or migration process to place abnormally high load on any part of the cloud, or to cause delays in the handling of API requests, even intermittently.
Glance status: Given that we’re talking about Glance services only (not the DBMS and not the storage backend), this should be achieved when the zero-downtime upgrade is implemented.
Services must prevent regression by implementing a zero-impact gate job wherein both a new version of the service and an old version of the service are run concurrently under load. A measurement of API response times must show that there are no statistically significant outliers during the upgrade process when compared to normal operations.
Glance status: needs to be implemented.
Proposed change¶
There are two major changes:
Process Documentation
What we need to establish is that the Glance project has “a defined plan that allows operators to roll out new code to subsets of services, eliminating the need to restart all services on new code simultaneously.”
The “Gaps” section of the Product Working Group’s “Rolling Updates and Upgrades” user story [PWG1] provides a useful list of the phases an operator would go through in performing a rolling upgrade of an OpenStack cloud. We propose to document the relevant phases clearly for Glance so that operators can understand the Glance rolling upgrade story.
The phases identified by the Product Working Group are:
Maintenance Mode
Live Migration
Upgrade Orchestration - Deploy
Multi-version Interoperability
Online Schema Migration
Graceful Shutdown
Upgrade Orchestration - Remove
Upgrade Orchestration - Tooling
Upgrade Gating
Project Tagging
For Glance, upgrading from release N-1 to release N, we can compress these into:
Upgrade Orchestration - Deploy
stage the code for release N to new Glance nodes
Online Schema Migration - Part 1
Multi-version interoperabilty
start the release N nodes
take the release N-1 nodes out of rotation, allowing them to drain
Upgrade Orchestration - Remove
take each release N-1 node offline once it has completed processing its current requests
Online Schema Migration - Part 2
final database schema migration (the “contract” phase as described in [GSP1])
Testing
Full stack integration testing with services arranged in a mid-upgrade manner is performed on every proposed commit to validate that mixed-version services work together properly.
This testing must be performed on configurations that the project considers to be its reference implementations.
The arrangement(s) tested will depend on the project (i.e. should be representative of a meaningful-to-operators rolling upgrade scenario) and available testing resources.
At least one representative arrangement must be tested full-stack in the gate.
We propose using Grenade [GRN1] for the full stack integration tests.
Alternatives¶
One alternative would be to choose not to support rolling upgrades in Glance. Such a choice, however, would impact other services that depend upon Glance (for example, Nova). Such services would experience disruptions during the Glance upgrade. So this doesn’t seem to be a serious alternative.
The proposal in this spec is to use the “disable by file” feature of the oslo healthcheck middleware to take the Glance nodes running “old” code out of rotation and allow them to drain. Stuart McLaren has suggested an alternative, namely to piggyback on the zero downtime configuration reload feature of Glance (available since the Kilo release [GLA1]) and create a “graceful stop” function that would accept a signal to shut down child processes as they complete. (See [GSP2] for details.)
Since we’ve got the “disable by file” functionality available, this alternative isn’t necessary to achieve the upgrade tags. It would, however, be an operator-friendly enhancement that we could pick up at some point.
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None
Performance Impact¶
None
Other deployer impact¶
It is anticipated that a rolling upgrade will require operator intervention.
Developer impact¶
Developers will need to be aware of Glance features that enable rolling upgrades and make sure they aren’t removed. (Developers will also need to work within the constraints of the database strategy for rolling upgrades, but that developer impact is covered by another spec.)
Implementation¶
Assignee(s)¶
- Primary assignee:
rosmaita hemanthm
- Other contributors:
nikhil
Work Items¶
Verify the accuracy of current Glance upgrade documentation.
Write documentation for rolling upgrade (developer docs).
Write documentation for rolling upgrade (operator docs).
Grenade tests.
Assert the tag and notify the OpenStack Technical Committee.
Dependencies¶
To achieve the assert::supports-zero-downtime-upgrade
tag, this spec
depends upon implementation of the spec “Database strategy for rolling
upgrades” [GSP1].
Testing¶
We’ll need to implement gate tests (see above).
Documentation Impact¶
Documentation of general information for Glance rolling upgrades, in particular:
The supported configuration(s) for rolling upgrades
The operator workflow for performing a rolling upgrade
References¶
https://governance.openstack.org/reference/tags/assert_supports-zero-downtime-upgrade.html
https://governance.openstack.org/reference/tags/assert_supports-zero-impact-upgrade.html