Online Schema Changes

https://blueprints.launchpad.net/nova/+spec/online-schema-changes

Make schema changes execute online (ie while services are running) when safely and semantically possible. This will allow operators to reduce the amount of downtime currently required during deploys.

Problem description

  • All database migrations are currently required to be run offline.

  • Database migrations have historically been a source of lengthy downtime during deployments.

  • Schema changes are required to be repeated in two places: database model defined in Nova and writing a migration script.

Use Cases

  • Any deployer that would like to reduce the amount of downtime during deploys.

  • Developers that would like to spend less time writing migration scripts.

  • Deployers that would like to maintain a set of local schema changes.

Project Priority

This fits under the ‘Live Upgrades’ kilo priorities.

Proposed change

A new alternative workflow for applying schema changes will be added that expands the schema, then contracts the schema (expand/contract workflow).

The new expand/contract workflow will not utilize any migration scripts, instead it will dynamically compare the running schema against the database model defined in Nova. DDL statements will be generated, and optionally executed, to make the running schema match the model.

The existing schema management workflow is the ‘db sync’ command to nova-manage. This is managed by sqlalchemy-migrate and uses individual migration scripts. This workflow will remain for now, but is expected to be removed at some future time, leaving the expand/contract workflow as the way to manage the database schema.

Until the sqlalchemy-migrate workflow is removed, all schema changes will still need sqlalchemy-migrate migration scripts to be written.

Three new nova-manage commands will be added:

  1. expand. This would apply changes that are compatible with old running code.

  2. migrate. This would apply changes that are necessary to be run offline.

  3. contract. This would apply changes that are compatible with new running code.

Those schema changes that can be safely and semantically applied while running online will be applied during the expand and contract phases. Also, only those schema changes that will not acquire long running locks in the database will be considered for the online phases (expand, contract). All other schema changes will be applied during the migrate phase.

The three new commands would be built by dynamically executing alembic’s autogenerate and DDL generating features. A list of differences would be generated by alembic and then DDL statements would be generated using a separate feature of alembic.

The set of DDL statements that can be run in each phase would be dictated by the database software used (eg MySQL, PostgreSQL, etc), the version of the database software (eg MySQL 5.5, 5.6, etc) and the storage engine used (eg InnoDB, TokuDB, etc).

As an example, index additions can be executed online in MySQL 5.5, but not 5.1. An index addition would be run during the expand phase for MySQL 5.5 or higher, but during the migrate phase for MySQL 5.1.

It is intended that the initial set that will run online will be conservative at first and a subset of what is possible to run safely. This can be safely expanded at any time in the future.

Schema changes that will be potentially performed during expand: - Table creates - Column additions - Non-Unique Index additions

Schema changes that will be potentially performed during migrate: - Unique Index additions/drops - Foreign Key additions/drops

Schema changes that will be potentially performed during contract: - Table drops - Column drops - Non-Unique Index drops

Some schema changes that aren’t currently used or are difficult to automate will not be allowed initially. For instance, column type changes will not be allowed initially. This is because not all column type changes can be automated because of complexity and database restrictions. A subset of column type changes may be implemented in the future if they can be automated on all databases.

The migrate and contract phases would verify that the previous phases (expand in the case of migrate, expand and migrate in the case of contract) no longer need to be executed before continuing.

This would be performed by generating the list of needed changes for the previous phases and verifying the list is empty. This indicates the previous phases were either run or unnecessary.

A new ‘–dryrun’ argument would print, instead of execute, each generated DDL statement. This could be used by database administrators to see what would be executed for a particular phase. These can be optionally executed manually if desired. The schema synchronizer will not generate that DDL statement since the running schema does not have that difference anymore.

When ‘db contract’ is finally run and the running schema has been verified to match the models, the version in the migrate_version table would be updated to the latest shipped sqlalchemy-migrate migration version. This would maintain compatibility with the existing sqlalchemy-migrate workflow.

The fundamental difference between the two workflows is the expand/contract workflow is declarative (by using the model) and the sqlalchemy-migrate workflow is imperative (by using migration scripts).

By being declarative, it limits changes to one place (database model) and allows for more intelligent decisions (by factoring in the database software, engine, version, etc)

Alternatives

Splitting the existing single stream of migrations into three separate streams of migrations. This would allow some schema changes to be executed online.

This limits the schema changes that can be safely executed online to that of the lowest common denominator of databases supported by Nova.

This would also require changes to sqlalchemy-migrate to be able to manage separate streams of migrations.

Another option would be remove the use of sqlalchemy-migrate for schema changes altogether. The ‘db sync’ command to nova-manage would be implemented by effectively calling ‘db expand’, ‘db migrate’ and ‘db contract’.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

Running online DDL changes can affect the performance of a running system. This is optional and is only done when the deployer explicitly requests it.

This can mitigated by the deployer by scheduling the expand and contract phases to be run during periods of low activity. The expand phase can be run an arbitrary amount of time before the migrate phase. Likewise, the contract phase does not need to be run immediately after the migrate phase is run.

Other deployer impact

Using the new expand/contract workflow is optional. If the deployer does not want to perform database schema changes online, then they can continue using the ‘db sync’ command with nova-manage.

Those deployers that want to take advantage of the online schema changes will need to run the ‘db expand’, ‘db migrate’ and ‘db contract’ commands at the appropriate steps in their deployment process.

Switching from the sqlalchemy-migrate workflow to the expand/contract workflow can happen at any time. The reverse can only happen after a final ‘db contract’ is run (to ensure all schema changes are applied and the migrate_version table is updated).

If the expand/contract workflow is used, then ‘db contract’ is required to be execute once for each formal release of Nova. This is to ensure that SQL namespaces (table, column, etc) can be reused in the future.

Deployers that have made local schema changes (extra indexes, columns, tables, etc) will need to update the model to ensure those additions aren’t dropped during the contract phase.

If using the expand/contract workflow, then deployers can run ‘db expand’ before stopping or restarting any services. ‘db migrate’ might acquire locks in the database and may affect running services. ‘db contract’ can be run after all Nova services are running the new code.

Developer impact

Eventually no more sqlalchemy-migrate migrations would need to be written leading to less work for developers.

No more migration compaction. The initial creation of tables for a database is handled completely by the schema synchronizer.

Some schema changes will no longer be allowed. This is generally restricted to schema changes that cannot be reasonably automated but those schema changes are generally the ones with the most downtime anyway.

Namespaces (table, column, index, etc) are not reusable in a formal release cycle. The contract phase is only required to be executed once per formal release, pinning old names until the next formal release.

Implementation

Assignee(s)

Primary assignee:

johannes.erdfelt

Other contributors:

None

Work Items

  • Implement schema synchronizer using alembic.autogenerate

  • Implement new ‘expand’, ‘migrate’ and ‘contract’ commands to ‘nova-manage db’

  • Ensure grenade and turbo-hipster tests are update

Dependencies

This builds on top of the validate-migrations-and-model spec. The existing use of alembic.autogenerate will now also be used to generate the list of needed changes to make the schema match the model.

This also depends on dropping the use of sqlalchemy-migrate for data migrations.

Testing

No tempest tests will be added since tempest does not do any upgrade testing.

A Nova unit test will be added to test starting from an empty database.

Grenade currently tests upgrades from older versions of Nova. A new test to use the new ‘db expand’, ‘db migrate’ and ‘db contract’ commands are necessary. This will be compared with the result of ‘db sync’ to ensure that upgrades from past commits end up semantically identical.

turbo-hipster tests upgrades using production database snapshots. It currently uses the ‘db sync’ command to nova-manage. The new expand/contract workflow will be tested as well to ensure that both workflows function correctly.

Documentation Impact

Documentation will need to be updated to include the new ‘expand’, ‘migrate’ and ‘contract’ commands to ‘nova-manage db’.

Release Notes will need to be updated to warn that the model will need to be updated with local schema changes.

Instance Types would need to be manually created as the 216 migration would not necessarily run anymore.

References

https://etherpad.openstack.org/p/kilo-nova-zero-downtime-upgrades