Add a service version number to the database¶
We have previously identified online data migrations as critical for supporting live upgrades. We do well with that today, as long as control services are upgraded together (conductor, api, etc). In order to extend this further, we need a way to determine that some of the control services have been upgraded, but not all. This information will allow us to avoid upgrading data online until all of the services are upgraded to the point at which they can handle the new schema.
Partially upgrading control services will result in newer conductors beginning to convert data to expand into new schema before older conductors and other control services are ready. If you upgrade all your control services together, this works well, but if you don’t (as would be more realistic) you have the potential to break some of the older control services.
For example, in Kilo we can apply the database schema before starting to upgrade any of the code. However, once we upgrade any one component that talks to the database directly, online migrations will begin to happen, and any other nodes that read from the database directly will be confused as things are starting to move. Since services like nova-api, nova-scheduler, nova-conductor, etc all talk directly to the database, we’re currently unable to upgrade these services independently. If we had this service version number available, we could avoid doing any online migrations until all the affected services are upgraded to a new-enough point.
As a deployer, I want to be able to stage my upgrades of control services, without having to take down all of my conductor, api, scheduler, etc nodes and restart them together spanning a live data migration.
This is an upgrades enhancement.
The proposed change is adding a version number column to the services table. Each service will report its version number when it updates its service record. We will be able to determine if all services are on the same level of code by checking to see if there is more than one version in the table (optionally grouped by service). We can use this information to conditionally enable online data migrations. For example, we can check to see if all the conductors are upgraded to the same level by doing something like this:
SELECT DISTINCT version FROM services WHERE binary=’conductor’;
If we do that at conductor startup, we can set a flag to not enable migrations that require conductor services to be newer than a specific version. We could also refresh this on SIGHUP like we currently do for config reload.
Initially, the column will be created with a default of NULL, and the Service object will treat NULL as “version 1”. Subsequent changes will set versions to 2. Any time we do an online data migration, we will need to bump the service version number.
Longer-term we could try to tie RPC versions to this information to make pinning easier. Because that has potentially hard-to-quantify implications on backports (which occasionally do need to modify RPC interfaces), I propose we leave that out of the scope of this work for now. Even still, tying RPC versions to this service version would be work for the next cycle once this is in place and we can depend on it.
The other thing that this enables is the ability to have a service start up and know that it is massively out of date. Presumably we should be able to have a conductor start up and say “wow, I’m much older than the other conductors, I should log an error and exit.” Determining what the minimum version is and should be is something we would do in M when we have this for the N and N-1 releases such that we can depend on it.
We need to do this in L so that we can leverage it in M. If we delay this until M, we won’t be able to rely on it until N.
Why not use semver?¶
In things like our RPC API versions, we use semver-like version numbers. This allows us to make decisions about incoming calls and whether they’re compatible with a newer node, and generally define rules for what is a breaking (i.e. major bump) change.
This service version number doesn’t imply any semantics itself, but rather just provides a vector with which we can orient ourselves in time to make other decisions. As described elsewhere in this spec, that may mean that we can decide what RPC version to use, or whether it is safe to start doing online data migrations. Those decisions extract semantic meaning out of the service version vector, and they may have significantly different rules (as would certainly be the case with the aforementioned RPC and DB decisions).
We can continue to do what we do today, which is start converting data from one schema to the other as soon as we roll new code. We keep the restriction of all control services being required to update together.
We could also codify this in config somehow, but that will require much more operator intervention, and increases the likelihood of error.
Data model impact¶
A schema migration will be written to add a version column to the table. The version will be an integer, nullable, default to NULL.
The Service object will be extended to support this version, and will treat a NULL version as version “1”, which will avoid us having to do any data migrations on existing service records. New saves will initially write version “2” for all services.
REST API impact¶
Other end user impact¶
Other deployer impact¶
This will make upgrading nova services easier and more flexible/forgiving for deployers.
Developers (and reviewers) will need to ensure that the service version number is bumped across any online data migration that we do (like the recent flavor restructuring).
- Primary assignee:
- Other contributors:
Write the schema migration
Update the sqlalchemy models and service object
Write some object methods to help with querying service version numbers in ways that will be friendly for determining upgrade feasibility.
Extend the service startup code to check version spread and persist so that we can use that as a static flag for enabling migrations.
None. This is mostly early setup for being able to do more interesting upgrade things in M.
This should be fully testable with unit tests.
Ideally, this should make upgrades require less documentation.