Nova Specs

Example Spec - The title of your blueprint

Wed, 14 Jan 2026 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Introduction paragraph – why are we doing anything? A single paragraph of prose that operators can understand. The title and this first paragraph should be used as the subject line and body of the commit message respectively.

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

At this point, if you would like to just get feedback on if the problem and proposed change fit in nova, you can stop here and post this for review to get preliminary feedback. If so please say: Posting to get preliminary feedback on the scope of this spec.

Alternatives

What other ways could we do this thing? Why aren’t we using those? This doesn’t have to be a full literature review, but it should demonstrate that thought has been put into why the proposed solution is an appropriate one.

Data model impact

Changes which require modifications to the data model often have a wider impact on the system. The community often has strong opinions on how the data model should be evolved, from both a functional and performance perspective. It is therefore important to capture and gain agreement as early as possible on any proposed changes to the data model.

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Note that the schema should be defined as restrictively as possible. Parameters which are required should be marked as such and only under exceptional circumstances should additional parameters which are not defined in the schema be permitted (eg additionaProperties should be False).

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

For more detailed guidance, please see the OpenStack Security Guidelines as a reference (https://wiki.openstack.org/wiki/Security/Guidelines). These guidelines are a work in progress and are designed to help you identify security best practices. For further information, feel free to reach out to the OpenStack Security Group at openstack-security@lists.openstack.org.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Work items or tasks – break the feature up into the things that need to be done to implement it. Those parts might end up being done by different people, but we’re mostly trying to understand the timeline for implementation.

Consider creating an ordering of patches, that allows gradually merging instead of the need to merge them all at once. For example if you are introducing a feature that requires implementation changes in multiple VM lifecycle operations then first add a step that rejects all the not yet supported actions with a HTTP 400 Bad Request. The error should explain that the <operation> is not supported with <feature> at this time. Then gradually remove the limitation as you progress with the implementation. This way we can merge your changes gradually and regardless when the feature freeze hit we can be sure that the system is consistent.

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Please discuss the important scenarios needed to test here, as well as specific edge cases we should be ensuring work correctly. For each scenario please specify if this requires specialized hardware, a full openstack environment, or can be simulated inside the Nova tree.

Please discuss how the change will be tested. We especially want to know what tempest tests will be added. It is assumed that unit test coverage will be added so that doesn’t need to be mentioned explicitly, but discussion of why you think unit tests are sufficient and we don’t need to add more tempest tests would need to be included.

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

If this is adding a new API microversion which alters a response schema, we will also need a corresponding change in Tempest to add the new schema under tempest/lib/api_schema/. This is required in order for the new microversion to be used in Tempest tests. Otherwise, new microversion requests will fail response schema validation in Tempest.

Documentation Impact

Which audiences are affected most by this change, and which documentation titles on docs.openstack.org should be updated because of this change? Don’t repeat details discussed above, but reference them here in the context of documentation for multiple audiences. For example, the Operations Guide targets cloud operators, and the End User Guide would need to be updated if the change offers a new feature available through the CLI or dashboard. If a config option changes or is deprecated, note here that the documentation needs to be updated to reflect this specification’s change.

References

Please add any useful references here. You are not required to have any reference. Moreover, this specification should still make sense when your references are unavailable. Examples of what you could include are:

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.2 Hibiscus	Introduced

Example Spec - The title of your blueprint

Wed, 14 Jan 2026 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.2 Hibiscus	Introduced

Example Spec - The title of your blueprint

Wed, 14 Jan 2026 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.2 Hibiscus	Introduced

Example Spec - The title of your blueprint

Wed, 14 Jan 2026 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.2 Hibiscus	Introduced

Graceful Shutdown of Nova Services: Part1

Thu, 04 Dec 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/nova-services-graceful-shutdown-part1

This proposes the spec 1 of the graceful shutdown backlog spec for the 2026.1 cycle.

Nova services do not shut down gracefully. When services are stopped, it also stops all the in-progress operations, which not only interrupt the in-progress operations, but can leave instances in an unwanted or unrecoverable state. The idea is to let services stop processing the new request, but complete the in-progress operations before service is terminated.

Problem description

Nova services do not have a way to shutdown gracefully means they do not wait for the in-progress operations to be completed. When shutdown is initiated, services wait for the RPC server to stop and wait so that they can consume all the existing request messages (RPC call/cast) from the queue, but the service does not complete the operation.

Each Nova compute service has a single worker running and listening on a single RPC server (topic: compute.<host>). The same RPC server is used for the new requests as well as for in-progress operations where other compute or conductor services communicate. When shutdown is initiated, the RPC server is stopped means it will stop handling the new request, which is ok, but at the same time it will stop the communication needed for the in-progress operations. For example, if live migration is in progress, the source and destination compute communicate (sync and async way) multiple times with each other. Once the RPC server on the compute service is stopped, it cannot communicate with the other compute and fails the live migration. It will lead the system as well as the instance to be in an unwanted or unrecoverable state

Use Cases

As an operator, I want to be able to gracefully shut down (SIGTERM) the Nova services so that it will not impact the users’ in-progress operations or keep resources in usable state.

As an operator, I want to be able to keep instances and other resources in a usable state even if service is gracefully terminated (SIGTERM).

As an operator, I want to be able to take the actual benefits of the k8s pod graceful shutdown when Nova services are running in k8s pods.

As a user, I want in-progress operations to be completed before the service is gracefully terminated (SIGTERM).

Proposed change

For detailed context, refer to the graceful shutdown backlog spec.

Split the new and in-progress requests via RPC:

RPC communication is an important part of services to finish a particular operation. During shutdown, we need to make sure we keep the required RPC servers/buses up. If we stop the RPC communication, then it is nothing different than service termination.

Nova implements, and this spec talks a lot about RPC server start, stop, and wait, so let’s cover them briefly from oslo.messaging/RPC resources point of view, and to understand this proposal in an easy way. Most of you might know this, so you can skip this section.

RPC server:
- creation and start():
  - It will create the required resources on oslo.messaging side, for example, dispatcher, consumer, listener, and queues.
  - It will handle the binding to the required exchanges.
- stop():
  - It will disable the listener ability to pick up any new message from the queue, but will dispatch the already picked message to the dispatcher.
  - It will delete the consumer.
  - It will not delete the queues and exchange on the message broker side.
  - It will not stop RPC clients sending new messages to the queue, however, they will not be picked because the consumer and listener are stopped.
- wait():
  - It will wait for the thread pool to finish dispatching all the already picked messages. Basically, this will make sure methods are called on the manager.

Analysis per services and the required proposed RPC design change:

The services listed below communicate with other Nova services’ RPC servers. Since they do not have their own RPC server, no change needed:
- Nova API
- Nova metadata API
- nova-novncproxy
- nova-serialproxy
- nova-spicehtml5proxy
Nova scheduler: No RPC change needed.
- Requests handling: Nova scheduler service runs as multiple workers, each having its own RPC server, but all the Nova scheduler workers will listen to the same RPC topic and queue scheduler with fanout way.
  
  Currently, nova.service.py->stop() calls stop() and wait() on RPC server. Once RPC server is stopped, it will stop listening to any new messages. But it will not impact anything on the other scheduler workers, and they continue listening to the same queue and process the request. If any of the scheduler worker is stopped, then the other workers will process the request.
- Response handling: Whenever there is a RPC call, oslo.messaging creates another reply queue connected with the unique message id. This reply queue will be used to send the RPC call response to the caller. Even if the RPC server is stopped on this worker, it will not impact the reply queue.
  
  We still need to keep the worker up until all the responses are sent via the reply queue, and for that, we need to implement the in-progress task tracking in scheduler services, but that will be handled in step 2.
This way, stopping a Nova scheduler worker will not impact the RPC communication on the scheduler service.
Nova conductor: No RPC change needed.

The Nova conductor binary is a stateless service that can spawn multiple worker threads. Each instance of the Nova conductor has its own RPC server, but all the Nova conductor instances will listen to the same RPC topic and queue conductor. This allows the conductor instance to act as a distributed worker pool such that stopping an individual conductor instance will not impact the RPC communication for the pool of conductor instances, allowing other available workers to process the request. Each cell has its own pool of conductors meaning as long as one conductor is up for any given cell the RPC communication will continue to function even when one or more conductors are stopped.

The request and response handling is done in the same way as mentioned for the scheduler.

Note

This spec does not cover the conductor single worker case. That might requires the RPC designing for conductor as well but it need more investigation.
Nova compute: RPC design change needed
- Request handling: The Nova compute runs as a single worker per host, and each compute per host has their own RPC server, listener, and separate queues. It handles the new request as well as the communication needed for in-progress operations on the same RPC server. To achieve the graceful shutdown, we need to separate communication for the new requests and in-progress operations. This will be done by adding a new RPC server in the compute service.
  
  For easy readability, we will be using a different term for each RPC server:
  - ‘ops RPC server’: This will be used for the new RPC server, which will be used to finish the in-progress requests and will stay up during shutdown.
  - ‘new request RPC server’: This will be used for the current RPC server, which is used for the new requests and will be stopped during shutdown.
- ‘new request RPC server’ per compute: No change in this RPC server, but it will be used for all the new requests, so that we can stop it during shutdown and stop the new requests on the compute.
- ‘ops RPC server’ per compute:
  - Each compute will have a new ‘ops RPC server’ which will listen to a new topic compute-ops.<host>. compute-ops name is used because it is mainly for compute operations, but a better name can be used if needed.
  - It will use the same transport layer/bus and exchange that the ‘new request RPC server’ uses.
  - It will create its own dispatcher, listener, and queue.
  - Both RPC server will be bound to the same endpoints (same compute manager), so that requests coming from either server are handled by the same compute manager.
  - This server will be mainly used for the compute-to-compute operations and server external events. The idea is to keep this RPC server up during shutdown so that the in-progress operations can be finished.
  - In shutdown, nova.service will wait for the compute to tell if they finished all their tasks, so that it can stop the ‘ops RPC server’ and finish the shutdown.
- Response handling: Irrespective of request is coming from either RPC server, whenever there is a RPC call, oslo.messaging creates another reply queue connected with the unique message id. This reply queue will be used to send the RPC call response to the caller. Even RPC server is stopped on this worker, it will not impact the reply queue.
- Compute service workflow:
  - SIGTERM signal is handled by oslo.service, it will call stop on nova.service
  - nova.service will stop the ‘new request RPC server’ so that no new requests are picked by the compute. The ‘ops RPC server’ is running and up.
  - nova.service will wait for the manager to signal once all in-progress operations are finished.
  - Once compute signal to nova.service, then it will stop the ‘ops RPC server’ and proceed with service shutdown.
- RPC client:
  - The RPC client stays as a singleton class, which is created with the topic compute.<host>, meaning that by default message will be sent via ‘new request RPC server’.
  - If any RPC cast/call wants to send a message via the ‘ops RPC server’, they need to override the topic to compute-ops.<host> during client.prepare() call.
  - If the RPC client detects an old compute (based on version_cap), then it will fall back to send the message to the ‘new request RPC server’ topic compute.<host>.
  - Which RPC cast/call will be using the ‘ops RPC server’ will be decided during implementation, so that we can have a better judgment on what all methods are used for the operations we want to finish during shutdown. A draft list where we can use the ‘ops RPC server’:
    
    Note
    
    This is draft list and can be changed during implementation.
    - Migrations:
      - Live migration:
        
        Note
        
        We will be using the ‘new request RPC server’ for check_can_live_migrate_destination and check_can_live_migrate_source methods, as this is the very initial phase where the compute service has not started the live migration. If shutdown is initiated before live migration request, came then migration should be rejected.
        
        pre_live_migration()
        
        live_migration()
        
        prep_snapshot_based_resize_at_dest()
        
        remove_volume_connection()
        
        post_live_migration_at_destination()
        
        rollback_live_migration_at_destination()
        
        drop_move_claim_at_destination()
      - resize methods
      - cold migration methods
    - Server external event
    - Rebuild instance
    - validate_console_port() This is when the console is already requested, and if port validation request is going on, the compute should finish it before shutdown so that users can get their requested console.
Time based waiting for services to finish the in-progress operations:

Note

The time based waiting is a temporary solution. Later, it will be replaced by the proper tracking of in-progress tasks.
- To make the graceful shutdown less complicated, this spec proposes a configurable time-based waiting for services to complete their operations.
- The wait time should be less than global graceful shutdown timeout. So that external system or oslo.service does not shut down the service before the service wait time is over.
Some specific examples of the shutdown issues which will be solved by this proposal:
- Migrations:
  - Migration operations will use the ‘ops RPC server’.
    - If migration is in-progress then the service shutdown will not terminate the migration; instead will be able to wait for the migration to complete.
    - Later, we will make long running migration to abort but that is out of scope from this spec.
  - Instance boot:
    - Instance boot operations will continue to use the ‘new request RPC server’. Otherwise, we will not be able to stop the new requests.
    - If instance boot requests are in progress by compute services, then shutdown will wait for compute to boot them successfully.
    - The instance external event will be received during graceful shutdown; therefore, an instance boot request will not be blocked for the external event.
    - If a new instance boot request arrives after the shutdown is initiated, then it will stay in the queue, and the compute will handle it once it is started again.
  - Any operations which is reached to compute will be completed before the service is shut down.

Note

As per testing till now (eventlet mode), it does not require any change in oslo.messaging but we need to test it by running compute in native thread mode (with oslo.service threading backend ).

Graceful Shutdown Timeouts:

Nova service timeout:

We need two configurable timeouts in Nova:
1. Overall Shutdown Timeouts:
  - The oslo.service already has the timeout (graceful_shutdown_timeout) which is configurable per service and used to timeout the SIGTERM signal handler.
  - The oslo.service will terminate the Nova service based on graceful_shutdown_timeout, even if the Nova service graceful shutdown is not finished.
  - Its default value is 60 seconds, which is less for Nova services. The proposal is to override its default value to 180 sec for all the Nova services.
  - The operator can override this value per Nova services.
2. Timeout for Nova service to finish the in-progress tasks:
  - When shutdown is initiated, each service needs to finish its in-progress tasks, which can take time, and we have to timeout that before oslo.service graceful_shutdown_timeout reached.
  - We need this timeout because after finishing the in-progress tasks, Nova services need to call cleanup_host() on the manager, which also need some time to finish. If we do not have this timeout and service takes more time to finish in-progress tasks, then oslo.service graceful_shutdown_timeout will not let cleanup_host() to be executed.
  - We need to add this configurable timeout option per the Nova services and their default value should be lower than graceful_shutdown_timeout,
External system timeout:

Depending on how Nova services are deployed, there might be an external system (for example, Nova running on k8s pods) timeout for graceful shutdown. That can impact the Nova graceful shutdown, so we need to document it clearly that if there is external system timeout, then Nova service timeout graceful_shutdown_timeout should be set accordingly. The external system timeout should be higher than graceful_shutdown_timeout, otherwise external system will timeout and will interrupt the Nova graceful shutdown.

Alternatives

One alternative for the RPC redesign is to handle the two topics per RPC server. This needs a good amount of changes in oslo.messaging framework as well as driver implementations. The idea is to allow oslo.messaging Target to take more than one topic (take topic as a list) and ask the driver to create separate consumers, listeners, dispatchers, and queues for each topic. Create each topic binding to the exchange. This also requires oslo.messaging to provide a new way to let the RPC server unsubscribe from a particular topic and continue listening on other topics. We also need to redesign how RPC server stop() and wait() works for now. This is too complicated and almost re-designing the oslo.messaging RPC concepts.

One more alternative is to track and stop sending the request from Nova api or the scheduler service, but that will not be able to stop all the new requests (compute to compute tasks) or let in-progress things to complete.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

This should provide a positive impact on end users so that the shutdown will not stop their in-progress operations.

Performance Impact

No impact on normal operations, but the service shutdown will take more time. There is a configurable timeout to control the service shutdown wait time.

Other deployer impact

None other than a longer shutdown process, but they can configurable an appropriate timeout for service shutdown.

Developer impact

None

Upgrade impact

Adding a new RPC server will impact the upgrade. The old compute will not have the new ‘ops RPC server’ listening on topic RPC_TOPIC_OPS, so we need to handle it with RPC versioning. If the RPC client detects an old compute (based on version_cap), then it will fall back to send the message to the original RPC server (listening to compute.<host>); and therefore graceful shutdown will not work on new compute nodes until all the computes are upgraded and the RPC version_cap is removed.

Implementation

Assignee(s)

Primary assignee:

gmaan

Other contributors:

None

Feature Liaison

None

Work Items

Implement the ‘ops RPC server’ on the compute service
Use the ‘ops RPC server’ for the operations we need to finish during shutdown, for example, compute-to-compute tasks and server external events.
RPC versioning due to upgrade impact.

Dependencies

Eventlet removal for all Nova services: We need to make sure that graceful shutdown works fine on native threading mode, so we need to wait until all compute services are moved to the native threading mode. That will test the oslo.service with threading backend.
oslo.service threading backend needs to consider the configurable graceful_shutdown_timeout.

Testing

We cannot write tempest tests for this because tempest will not be able to stop the services.
We can try (with some heavy live migration which will takes time) some testing in ‘post-run’ phase like it is done for evacuate tests.
Unit and functional tests will be added.

Documentation Impact

Graceful shutdown working will be documented along with other considerations, for example, timeout or wait time considered for the graceful shutdown.

References

PoC:
- Code change: https://review.opendev.org/c/openstack/nova/+/967261
- PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/
PTG discussions:

History

Revisions
Release Name	Description
2026.1 Gazpacho	Introduced

Graceful Shutdown of Nova Services

Wed, 03 Dec 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/nova-services-graceful-shutdown

This is backlog spec proposing the design of graceful shutdown.

Problem description

Each Nova compute service has a single worker running and listening on a single RPC server (topic: compute.<host>). The same RPC server is used for the new requests as well as for in-progress operations where other compute or conductor services communicate. When shutdown is initiated, the RPC server is stopped means it will stop handling the new request, which is ok, but at the same time it will stop the communication needed for the in-progress operations. For example, if live migration is in progress, the source and destination compute communicate (sync and async way) multiple times with each other. Once the RPC server on the compute service is stopped, it cannot communicate with the other compute and fail the live migration. It will lead the system as well as the instance to be in an unwanted or unrecoverable state

Use Cases

As an operator, I want to be able to gracefully shut down (SIGTERM) the Nova services so that it will not impact the users’ in-progress operations or keep resources in usable state.

As an operator, I want to be able to keep instances and other resources in a usable state even if service is gracefully terminated (SIGTERM).

As an operator, I want to be able to take the actual benefits of the k8s pod graceful shutdown when Nova services are running in k8s pods.

As a user, I want in-progress operations to be completed before the service is gracefully terminated (SIGTERM).

Proposed change

Scope: The proposed solution is to gracefully shutdown the services for the SIGTERM signal.

The graceful shutdown is based on the following design principles:

When service shutdown is initiated by SIGTERM:
- Do not process any new requests
- New requests should not be lost. Once service is restarted, it should process the requests.
- Allow in-progress operations to reach their quickest safe termination point, either completion or abort.
- Proper logging of the state of in-progress operations
- Keep instances or other resources in a usable state
When service shutdown is completed:
- Proper logging of unfinished operations. Ideally, all the in-progress operations should be completed before service is terminated, but if graceful shutdown times out (due to a configured timeout, adding the timeout details in later section) then there should be a proper logging of all the unfinished operations. This will help to recover the system or instances.
When service is started again:
- Start processing the new requests in the normal way.
- If the requests were not processed due to the shutdown being initiated, then they stay in message broker queue and there are multiple possibilities:
  - Requests might have been picked by the other worker of that service. For example, you can run more than one Nova scheduler (or conductor) worker. If one of the worker is shutting down, then other worker will process the request. This is not the case for Nova compute which is always a single worker per compute service on specific host.
  - If a service has single worker running, then request can be picked up once service is up again.
  - There is an opportunity for the compute service to cleanup or recover the interrupted operation on instances during init_host(). The action taken will depends on the tasks and its status.
  - If the service is in the stopped state for a long time, based on the RPC and message queue timeout, there is chance that:
    - The RPC client or server will timeout the call.
    - The message broker queue may drop messages due to timeout.
    - The order of requests and messages can be stale.

As a graceful shutdown goal, we need to do two things:

A way to stop new requests, but do not interrupt in-progress operations. This is proposed to be done via RPC.
Give services enough time to finish the operations. As a first step, this is proposed to be done via time-based wait and later with a proper tracking mechanism.

This backlog spec proposes achieving the above goals in multiple steps. Each step will be proposed as a separate spec for a specific release.

The Nova services which already gracefully shutdown:

For the below services, their graceful shutdown is handled by their deployment servers or used library.

Nova API & Nova metadata API:

Those services are deployed using a server with WSGI support. That server will ensure that Nova API services shuts down gracefully, meaning it finishes the in-progress requests and rejects the new requests.

I investigate with uWSGI/mod_proxy_uwsgi (devstack env). On service start, uWSGI server pre-spawn the number of workers for API service which will handle the API requests in distributed way. When shutdown is initiated by SIGTERM, the uWSGI server SIGTERM handler check if there are any in-progress request on any worker. It wait for all the workers to finish the request and then terminates each worker. Once all worker are terminated then it will terminate the Nova API service.

If any new request comes after the shutdown is initiated, it will be rejected with “503 Service Unavailable” error.

Testing:

I tested two types of requests:
1. Sync request: ‘openstack server list’:
  - To observe the graceful shutdown, I added 10 seconds of sleep in the server list API code.
  - Start a API request ‘request1’: openstack server list
  - Wait till the server list request reaches the Nova API (you can see the log from the controller)
  - Because of sleep(10), the server list takes time to finish.
  - Initiate the Nova API service shutdown.
  - Start a new API request ‘request2’: openstack server list. This new requests came after shutdown is initiated so it should be denied.
  - Nova API service will wait because ‘request1’ is not finished.
  - ‘request1’ will get the response of the server list before the service is terminated.
  - ‘request2’ is denied and will receive the error “503 Service Unavailable”
2. Async request: openstack server pause <server>:
  - To observe the graceful shutdown, I added 10 seconds of sleep in the server pause API code.
  - Start a API request ‘request1’: openstack server pause server1
  - Wait till the pause server request reaches the Nova API (you can see the log from the controller)
  - Because of sleep(10), the pause server takes time to finish.
  - Initiate the Nova API service shutdown.
  - Service will wait because ‘request1’ is not finished.
  - Nova API will make an RPC cast to the Nova compute service and return.
  - ‘request1’ is completed, and the response is returned to the user.
  - Nova API service is terminated now.
  - Nova compute service is operating the pause server request.
  - Check if server is paused openstack server list
  - You can see the server is paused.
Nova console proxy services: nova-novncproxy, nova-serialproxy, and nova-spicehtml5proxy:

All the console proxy services run as websockify.websocketproxy service. The websockify library handles the SIGTERM signal and the graceful shutdown, which is enough for the Nova services.

When a user access the console, websockify library starts a new process in start_service and calls Nova new_websocket_client . Nova will be authorizing the token, creating a socket on the host & port, which will be used to send the data/frames. After that, user can access the console.

If a shutdown request is initiated, websockify handle the signal. First, it will terminate all the child processes and then raise the terminate exception, which ends up calling the Nova close_connection method. The Nova close_connection method calls shutdown() on the socket first and then close(), which makes sure to send the remaining data/frame before closing the socket.

This way, user console sessions will be terminated gracefully, and they will get “Disconnected” message. Once service is up, the user can refresh the browser, and the console will be up again (if the token has not expired).

Spec 1: Split the new and in-progress requests via RPC:

RPC server:
- creation and start():
  - It will create the required resources on oslo.messaging side, for example, dispatcher, consumer, listener, and queues.
  - It will handle the binding to the required exchanges.
- stop():
  - It will disable the listener ability to pick up any new message from the queue, but will dispatch the already picked message to the dispatcher.
  - It will delete the consumer.
  - It will not delete the queues and exchange on the message broker side.
  - It will not stop RPC clients sending new messages to the queue, however, they will not be picked because the consumer and listener are stopped.
- wait():
  - It will wait for the thread pool to finish dispatching all the already picked messages. Basically, this will make sure methods are called on the manager.

Analysis per services and the required proposed RPC design change:

The services listed below communicate with other Nova services’ RPC servers. Since they do not have their own RPC server, no change needed:
- Nova API
- Nova metadata API
- nova-novncproxy
- nova-serialproxy
- nova-spicehtml5proxy
Nova scheduler: No RPC change needed.
- Requests handling: Nova scheduler service runs as multiple workers, each having its own RPC server, but all the Nova scheduler workers will listen to the same RPC topic and queue scheduler with fanout way.
  
  Currently, nova.service.py->stop() calls stop() and wait() on RPC server. Once RPC server is stopped, it will stop listening to any new messages. But it will not impact anything on the other scheduler worker, and they continue listening to the same queue and process the request. If any of the scheduler worker is stopped, then the other workers will process the request.
- Response handling: Whenever there is a RPC call, oslo.messaging creates another reply queue connected with the unique message id. This reply queue will be used to send the RPC call response to the caller. Even if the RPC server is stopped on this worker, it will not impact the reply queue.
  
  We still need to keep the worker up until all the responses are sent via the reply queue, and for that, we need to implement the in-progress task tracking in scheduler services, but that will be handled in step 2.
This way, stopping a Nova scheduler worker will not impact the RPC communication on the scheduler service.
Nova conductor: No RPC change needed.

The Nova conductor binary is a stateless service that can spawn multiple worker threads. Each instance of the Nova conductor has its own RPC server, but all the Nova conductor instances will listen to the same RPC topic and queue conductor. This allows the conductor instance to ack as a distributed worker pool such that stopping an individual conductor instance will not impact the RPC communication for the pool of conductor instances, allowing other available workers to process the request. Each cell has its own pool of conductors meaning as long as one conductor is up for any given cell the RPC communication will continue to function even when one or more conductors are stopped.

The request and response handling is done in the same way as mentioned for the scheduler.

Note

This spec does not cover the conductor single worker case. That might requires the RPC designing for conductor as well but it need more investigation.
Nova compute: RPC design change needed
- Request handling: The Nova compute runs as a single worker per host, and each compute per host has their own RPC server, listener, and separate queues. It handles the new request as well as the communication needed for in-progress operations on the same RPC server. To achieve the graceful shutdown, we need to separate communication for the new requests and in-progress operations. This will be done by adding a new RPC server in the compute service.
  
  For easy readability, we will be using a different term for each RPC server:
  - ‘ops RPC server’: This will be used for the new RPC server, which will be used to finish the in-progress requests and will stay up during shutdown.
  - ‘new request RPC server’: This will be used for the current RPC server, which is used for the new requests and will be stopped during shutdown.
- ‘new request RPC server’ per compute: No change in this RPC server, but it will be used for all the new requests, so that we can stop it during shutdown and stop the new requests on the compute.
- ‘ops RPC server’ per compute:
  - Each compute will have a new ‘ops RPC server’ which will listen to a new topic compute-ops.<host>. compute-ops name is used because it is mainly for compute operations, but a better name can be used if needed.
  - It will use the same transport layer/bus and exchange that the ‘new request RPC server’ uses.
  - It will create its own dispatcher, listener, and queue.
  - Both RPC server will be bound to the same endpoints (same compute manager), so that requests coming from either server are handled by the same compute manager.
  - This server will be mainly used for the compute-to-compute operations and server external events. The idea is to keep this RPC server up during shutdown so that the in-progress operations can be finished.
  - In shutdown, nova.service will wait for the compute to tell if they finished all their tasks, so that it can stop the ‘ops RPC server’ and finish the shutdown.
- Response handling: Irrespective of request is coming from either RPC server, whenever there is a RPC call, oslo.messaging creates another reply queue connected with the unique message id. This reply queue will be used to send the RPC call response to the caller. Even RPC server is stopped on this worker, it will not impact the reply queue.
- Compute service workflow:
  - SIGTERM signal is handled by oslo.service, it will call stop on nova.service
  - nova.service will stop the ‘new request RPC server’ so that no new requests are picked by the compute. The ‘ops RPC server’ is running and up.
  - nova.service will wait for the manager to signal once all in-progress operations are finished.
  - Once compute signal to nova.service, then it will stop the ‘ops RPC server’ and proceed with service shutdown.
- Timeout:
  - There is an existing graceful_shutdown_timeout config option present on oslo.service which can be set per service.
  - That is honoured to timeout the service stop, and it will stop service irrespective of the compute finishing the things.
- RPC client:
  - The RPC client stays as a singleton class, which is created with the topic compute.<host>, meaning that by default message will be sent via ‘new request RPC server’.
  - If any RPC cast/call wants to send a message via the ‘ops RPC server’, they need to override the topic to compute-ops.<host> during client.prepare() call.
  - Which RPC cast/call will be using the ‘ops RPC server’ will be decided during implementation, so that we can have a better judgment on what all methods are used for the operations we want to finish during shutdown. A draft list where we can use the ‘ops RPC server’:
    
    Note
    
    This is draft list and can be changed during implementation.
    - Migrations:
      - Live migration:
        
        Note
        
        We will be using the ‘new request RPC server’ for check_can_live_migrate_destination and check_can_live_migrate_source methods, as this is the very initial phase where the compute service has not started the live migration. If shutdown is initiated before live migration request, came then migration should be rejected.
        
        pre_live_migration()
        
        live_migration()
        
        prep_snapshot_based_resize_at_dest()
        
        remove_volume_connection()
        
        post_live_migration_at_destination()
        
        rollback_live_migration_at_destination()
        
        drop_move_claim_at_destination()
      - resize methods
      - cold migration methods
    - Server external event
    - Rebuild instance
    - validate_console_port() This is when the console is already requested, and if port validation request is going on, the compute should finish it before shutdown so that users can get their requested console.
Time based waiting for services to finish the in-progress operations:

Note

The time based waiting is a temporary solution in spec 1. In spec 2, it will be replaced by the proper tracking of in-progress tasks.
- To make the graceful shutdown less complicated, spec 1 proposes to configurable time-based waiting for services to complete their operations.
- The wait time should be less than global graceful shutdown timeout. So that external system or oslo.service does not shut down the service before the service wait time is over.
- It will be configurable per service.
- Proposal for the default value:
  - compute service: 150 sec, considering long-running operations on compute.
  - conductor service: 60 sec should be enough.
  - scheduler service: 60 sec should be enough.
PoC: This PoC shows the working of the spec 1 proposal.
- Code change: https://review.opendev.org/c/openstack/nova/+/967261
- PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/
Some specific examples of the shutdown issues which will be solved by this proposal:
- Migrations:
  - Migration operations will use the ‘ops RPC server’.
    - If migration is in-progress then the service shutdown will not terminate the migration; instead will be able to wait for the migration to complete.
  - Instance boot:
    - Instance boot operations will continue to use the ‘new request RPC server’. Otherwise, we will not be able to stop the new requests.
    - If instance boot requests are in progress by compute services, then shutdown will wait for compute to boot them successfully.
    - If a new instance boot request arrives after the shutdown is initiated, then it will stay in the queue, and the compute will handle it once it is started again.
  - Any operations which is reached to compute will be completed before the service is shut down.

Note

As per my PoC and manual testing till now, it does not require any change on oslo.messaging side.

Spec 2: Smartly track and wait for the in-progress operations:

The below services graceful shutdown is handled by their deployed server or library so no work is needed for Spec 2:
- Nova API
- Nova metadata API
- nova-novncproxy
- nova-serialproxy
- nova-spicehtml5proxy
The below services need to implement the tracking system:
- Nova compute
- Nova conductor
- Nova scheduler

This proposal is to make the service wait time based on tracking the in-progress tasks. Once the service finishes the tasks, then they can signal to nova.service to proceed with shutting down the service. Basically, this replaces the wait time approach mentioned above with a tracker-based approach.

There will be a task tracker introduced to track the in-progress tasks.
It will be a singleton object.
It maintains a list of ‘method names’ and request-id. If task is related to instance, then we can add the instance UUID also that can help to filter or know what all operations on specific instance is in-progress. The unique request-id will help to track multiple calls to the same method.
Whenever a new request comes to compute, it will add that to the task list and remove it once the task is completed. Modification to the tracker will be done under lock.
Once shutdown is initiated:
- The task tracker will either add the new tasks to the tracker list or reject them. The decision will be made by case, for example, reject the tasks if they are not critical to handle during shutdown.
- During shutdown, any new periodic tasks will be denied, but in-progress periodic tasks will be finished.
- An exact list of tasks which will be rejected and accepted will be decided during implementation.
- The task tracker will start logging the tasks which are in progress, and log when they are completed. Basically, log the detail view of in-progress things during shutdown.
nova.service will wait for the task tracker to finish the in-progress tasks until timeout.
Example of the flow of RPC servers stop, wait, and task tacker wait will be something like:
- We can signal tast tracker to start logging the in-progress tasks.
- RPCserver1.stop()
- RPCserver1.wait()
- manager.finish_tasks(): wait for manager to finish the in-progress tasks.
- RPCserver2.stop()
- RPCserver2.wait()

Spec 3: Safe termination point for Nova Operations:

In graceful shutdown, all the in-progress operations should reach their safe termination point, either completion or abort.

This needs to be done based on the operation type and at what stage they are in. There are some operations, for example, pre-copy live migration, cold migration, resize, snapshot, or shelve_offload are ok to abort with proper logging and exception type. The user can request them again once the service is up.

Some operation, for example, post-copy live migrations are difficult to abort if VM is already moved to the destination compute. This might need some way to revert the VM to the source or let it complete.

The scope of this spec is to investigate and audit all the operations and categorize them ‘Ok to abort’ and ‘Wait to complete’. Accordingly, graceful shutdown needs to implement the logic to abort or continue to wait for the operation completion.

Graceful Shutdown Timeouts:

Nova service timeout:
- oslo.service already has the timeout (graceful_shutdown_timeout) which is configurable per service and used to timeout the SIGTERM signal handler.
- oslo.service will terminate the Nova service based on graceful_shutdown_timeout, even Nova service graceful shutdown is not finished.
- No new configurable timeout will be added for the Nova, instead it will use the existing graceful_shutdown_timeout.
- Its default value is 60 sec, which is less for Nova services. The proposal is to override its default value per Nova services:
  - compute service: 180 sec (Considering the long running tasks).
  - conductor service: 80 sec
  - scheduler service: 80 sec
External system timeout:

Depending on how Nova services are deployed, there might be an external system (for example, Nova running on k8s pods) timeout for graceful shutdown. That can impact the Nova graceful shutdown, so we need to document it clearly that if there is external system timeout, then Nova service timeout graceful_shutdown_timeout should be set accordingly. The external system timeout should be higher than graceful_shutdown_timeout, otherwise external system will timeout and will interrupt the Nova graceful shutdown.

Alternatives

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

This should provide a positive impact on end users so that the shutdown will not stop their in-progress operations.

Performance Impact

No impact on normal operations, but the service shutdown will take more time. There is a configurable timeout to control the service shutdown wait time.

Other deployer impact

None other than a longer shutdown process, but they can configurable an appropriate timeout for service shutdown.

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:

gmaan

Other contributors:

None

Feature Liaison

gmaan

Work Items

Implement the ‘ops RPC server’ on the compute service
Use the ‘ops RPC server’ for the operations we need to finish during shutdown, for example, compute-to-compute tasks and server external events.
RPC versioning due to upgrade impact.
Implement a task tracker for services to track and report the in-progress tasks during shutdown.

Dependencies

No dependency as of now, but we will see during implementation if any change is needed in oslo.messaging.

Testing

We cannot write tempest tests for this because tempest will not be able to stop the services.
We can try (with some heavy live migration which will takes time) some testing in ‘post-run’ phase like it is done for evacuate tests.
Unit and functional tests will be added.

Documentation Impact

Graceful shutdown working will be documented along with other considerations, for example, timeout or wait time considered for the graceful shutdown.

References

PoC:
- Code change: https://review.opendev.org/c/openstack/nova/+/967261
- PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/
PTG discussions:

History

Revisions
Release Name	Description
2026.1 Gazpacho	Introduced

libvirt - Use built-in firmware auto-selection for UEFI firmware

Tue, 11 Nov 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-firmware-auto-selection

Libvirt introduced its built-in firmware auto-selection for UEFI firmware, which automatically fills paths for CODE file and VAR file of UEFI firmware files, according to the requested features. This feature is more sophisticated and is capable to detect a few new flags recently introduced, like AMD SEV or stateless firmware.

This spec proposes replacing the existing own logic within nova by the built-in one, so that we don’t have to maintain our own logic and leverage the improved mechanism in underlying libvirt.

Problem description

Recent libvirt is capable to select the appropriate firmware files for domains using UEFI boot, according to the requested features such as:

secure boot
amd-sev/amd-sev-es/amd-sev-snp
stateless firmware

This feature is called auto-selection in libvirt and it reads the flags maintained in firmware descriptor files provided by qemu packages in distros.

Nova introduced its own logic when secure boot support was introduced. Because nova explicitly defines firmware files being used for every instance with UEFI boot, libvirt skips its auto-selection feature and use the specified files accordingly. However the existing logic in nova only considers the secure-boot flag, so it is not able to select appropriate firmware for the other features. As a result, an instance with additional features may be launched with a wrong firmware file. One example is stateless firmware, for which a firmware file with “stateless” flag should be used, but the current nova may not consider this flag and may launch instances with a CODE file,

which has non-zero VAR file associated.

In addition to the new feature flags, recent QEMU packages introduced the new ROM type firmwares. Libvirt can recognize these, but nova is not able to handle that new types due to the different keys used to define the firmware path file.

Use Cases

As a cloud administrator, I want nova to select the appropriate firmware file according to the features user requested, without additional configuration.
As a cloud user, I want my instance to be booted with an appropriate firmware, according to the feature requested.

Proposed change

We propose the following changes in the way guest XML is generated by libvirt driver, so that firmware files are selected by libvirt according to the requested features.

Stop explicitly passing paths for code file and var file when defining a domain.
- Current nova fills the loader element and the nvram element when generating a guest XML. The example below describes the os element of a guest XML with secure-boot.
```
<os>
  <type machine='q35'>hvm</type>
  <loader type='pflash' readonly='yes' secure='yes'>/usr/share/OVMF/OVMF_CODE.secboot.fd</loader>
  <nvram template='/usr/share/OVMF/OVMF_VARS.secboot.fd'/>
  <boot dev='hd'/>
  <smbios mode='sysinfo'/>
</os>
```
- Once the proposed change is implemented, nova no longer fills these file paths but adds the firmware feature element for secure-boot. The example below describes the os element of a guest XML with secure-boot.
```
<os firmware='efi'>
  <type machine='q35'>hvm</type>
  <loader secure='yes'/>
  <firmware>
    <feature enabled='yes' name='secure-boot'/>
  </firmware>
  <boot dev='hd'/>
  <smbios mode='sysinfo'/>
</os>
```
  - Note that the firmware='efi' is the key to tell libvirt detect the firmware file paths.
  - To keep the existing behavior for guests without secure-boot, the feature is explicitly rejected in a guest XML if secure-boot feature is not requested.
    <os firmware='efi'> <type machine='q35'>hvm</type> <loader secure='no'/> <firmware> <feature enabled='no' name='secure-boot'/> </firmware> <boot dev='hd'/> <smbios mode='sysinfo'/> </os>
- However, libvirt does not require these firmware feature elements for stateless firmware (libvirt reads the stateless property in the loader element) and AMD SEV/SEV-ES (libvirt reads the launchSecurity element). For example, the os element of a guest XML should look like the example below when stateless firmware is requested.
```
<os firmware='efi'>
  <type machine='q35'>hvm</type>
  <loader secure='no' stateless='yes'/>
  <firmware>
    <feature enabled='no' name='secure-boot'/>
  </firmware>
  <boot dev='hd'/>
  <smbios mode='sysinfo'/>
</os>
```
During the following operations, check the loader element and the nvram element in the existing guest XML, and then explicitly pass these elements and disable auto detection to generate the new guest XML, so that firmware files are not re-selected during these operations.
- hard-reboot (and start)
- live migration
Note

It’s possible that the domain xml does not exist when an instance is started (for example if the instance is booted from a volume and its host is reinstalled). In that case xml is generated from scratch and firmware file paths may be changed after the operation.

Alternatives

An alternative approach is to implement the same auto selection logic in nova, but this requires effort to keep the implementation consistent with libvirt. This causes concern with future code maintenance for no large benefit.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

Instances may be launched with a different (but correct) firmware after the operations which generate domain XML from scratch, such as

Start, when the domain definition does not exists on the hypervisor

Rebuild

Shelve

Resize or cold migration

Evacuate

Performance Impact

None

Other deployer impact

Deployers have to ensure the firmware descriptor files (which are typically located in /usr/share/qemu/firmware/) are updated to contain the flags for the expected features.

Developer impact

None

Upgrade impact

As described in the end-user impact section, instances may be launched with a different firmware file, when it is being launched by a new libvirt driver.
After upgrade, instance creation might fail in case none of the firmware descriptor files do not contain the required flags (sev flags and stateless flag) which were not checked earlier.

Implementation

Assignee(s)

Primary assignee:: kajinamit (irc: tkajinam)
Other contributors:: None

Work Items

Add detection of used firmware files from existing libvirt XML
Update generation of XML file by libvirt driver to fill firmware file paths only when these are explicitly given.
Update live migration and hard reboot to pass the firmware files currently used, when generating a new XML file.

Unit tests and functional tests should be added according to new logic.

Dependencies

Libvirt >= 5.2.0 is required to use auto-selection feature. This is already enforced by minimum libvirt version check.
QEMU firmware files and their descriptor files are updated to contain flags for the features requested by nova. The firmware descriptor files installed by supported distributions mostly contain the required flags, but Ubuntu 24.04 is known to require an update for AMD SEV-ES support. See Bug 2122286 for details.

Testing

Corresponding unit/functional tests will need to be extended or added to cover:

Simplified XML file passed during instance XML generation, which does not contain explicit firmware file paths
XML file generated by hard-reboot or live migration should contain explicit firmware file paths, according to the ones in the existing domain XML.

For secure boot and stateless firmware, new scenario tests may be added to tempest. However cirros, which is currently used in CI for guest OS, does not support UEFI boot and we need a different (and likely more heavy) guest OS for these features. In case it was determined that we can’t use these guest in CI, these features may be tested locally.

Also, AMD SEV and AMD SEV-ES have no real firmware available in CI so these will be manually tested.

Documentation Impact

None

References

Configuration guide of UEFI boot

History

Revisions
Release Name	Description
2026.1 Gazpacho	Approved

Support for tracking traits removed from provider.yaml

Wed, 05 Nov 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/copy-applied-provider-yaml

This specification proposes a feature to ensure that traits removed from the provider.yaml are also properly deleted from the resource provider.

Problem description

Nova-compute has a feature to register custom traits with the resource provider using config files (provider.yaml). https://docs.openstack.org/nova/latest/admin/managing-resource-providers.html

In this configuration file, even if the values of custom traits are modified or the trait is deleted, the original trait does not be removed from the target resource provider. In scenarios where the custom trait registered with the resource provider is replaced and old custom traits affect scheduling, this behavior can be a problem.

Use Cases

As a cloud operator, I would like to ensure that only one trait is registered with the resource provider for custom traits of the same type.
As a cloud operator, I would like to complete the registration of custom traits in the config file of nova-compute without additional implementation (calling the Placement API using API/CLI in another system).

Proposed change

We propose adding a process for nova-compute to copy the contents of the provider.yaml file to /var/lib/nova/applied_provider.yaml after they have been applied to the placement.

Then, when updating the placement based on the provider.yaml file, nova-compute perform a diff between /var/lib/nova/applied_provider.yaml and /etc/nova/provider.yaml to detect if any traits have been removed from the provider.yaml file.

For now, the diff is limited to traits, but later this logic can be extended to allow the use of the diff for any part of the provider.yaml.

Alternatives

Register only the custom traits defined in the file with the resource provider, treating provider.yaml as declarative data. However, this is a destructive change and there are concerns about the impact on the existing environment.
Add a definition like declarative_prefix to provider.yaml to handle only traits with a declarative_prefix declaratively. In this case, the extensibility to non-trait elements in provider.yaml is limited, and both the definition in provider.yaml and the code of the resource tracker become complex.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

No performance impact on nova is anticipated. If there are frequent updates to custom traits, requests for deleting and creating traits will be frequently sent to the Placement API.

Other deployer impact

None

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: mkuroha
Other contributors:: None

Feature Liaison

Feature liaison:: Liaison Needed

Work Items

Implement the copying of provider.yaml and extraction of trait diffs with applied_provider.yaml in the _merge_provider_configs method.

Dependencies

None

Testing

Add unit/functional tests

Documentation Impact

Update the existing Managing Resource Providers Using Config Files guide to explation the behavior with applied_provider.yaml.

References

None

History

Revisions
Release Name	Description
2025.2 Flamingo	Approved
2026.1 Flamingo	Reproposed

vTPM live migration

Wed, 17 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/vtpm-live-migration

When Nova first added vTPM support, all non-spawn operations were rejected at the API level. Extra work was necessary to manage the vTPM state when moving an instance. This work was eventually completed for resize and cold migration, and those operations were unblocked. The blocks on live migration, evacuation, shelving and rescue are still in place.

A TPM device is required for certain features of Windows Server 2022 and 2025, notably BitLocker Drive Encryption. It’s also required to run Windows 11 at all. The inability to live migrate instances with vTPM is a major roadblock for anyone operating Windows guests in an OpenStack cloud.

Libvirt support for vTPM live migration now exists (more details in Problem description), but Nova changes are necessary before being able to remove the API block. This spec describes those changes.

Problem description

There are four aspects to vTPM live migration: shared vs non-shared vTPM state storage, Libvirt support, and secret management. There is also an adjacent problem, that - while not related to live migration - can be resolved by the changes necessary to support live migration: vTPM instances cannot be started back up by Nova after a compute host reboot.

vTPM state storage

vTPM state storage is not the same as instance state storage and Libvirt supports the use of local storage and shared storage such as NFS, for both.

Libvirt can be told where to store the vTPM state via the source XML element, which Nova does not support. Nova deployments use the Libvirt default vTPM state path. On both Ubuntu and Red Hat operating systems, this path is /var/lib/libvirt/swtpm/<instance UUID>. This path is distinct from the instance state path.

Testing will generally focus on local storage and could be expanded to shared storage like NFS in the future. Currently the Nova CI gate does not have any jobs that are configured with NFS.

Libvirt support

Though it was impossible to find Libvirt artifacts explicitly demonstrating vTPM live migration support for non-shared vTPM state storage, as of version 8.10, vTPM live migration with shared vTPM storage is supported, and this comment suggests that for non-shared storage, vTPM live migration has been supported since version 7.1.0.

Therefore, this spec requires Libvirt 7.1.0. Our current minimum Libvirt version is 8.0.0 as of 2025.1 (Epoxy), so we will not need to do any minimum version checks while implementing this feature.

Secret management

When creating an instance with vTPM, Nova asks a key manager - normally Barbican - to generate a secret. Crucially, this is done with the user’s token, and the created secret is owned by the user, with no one else - not even admin or the Nova service user - being able to read it. Nova then defines the secret in Libvirt, and in the instance XML references the secret by its UUID. This tells Libvirt to encrypt the instance’s vTPM state using the contents of that secret as the symmetric key. Nova undefines the secret once the Libvirt domain spawns successfully.

For vTPM live migration to work, a Libvirt secret with the same UUID and contents needs to be defined on the destination host so that destination Libvirt can decrypt the vTPM state. Currently, Nova has no way of doing this. Live migration is an admin operation, and neither admin nor the Nova service user have access to the Barbican secret (unless the admin happens to be the owen of the instance, but that’s an edge case). The Libvirt secret cannot be read back on the source host either, because it’s defined as private and is undefined once the domain spawns.

Compute host reboot

For the exact same reasons (lack of Barbican secret access and inability to read the Libvirt secret back from Libvirt), Nova cannot start back up vTPM instances after a compute host reboot.

Use Cases

As a cloud operator, I want to be able to live migrate instances with vTPM devices, in particular Windows instances.

As a cloud user, I want to keep the contents of my instance’s vTPM private. The cloud system should only be able to decrypt it when I request it via my user token and the system should only keep the decryption secret around for a limited time. I as a user am willing to accept that such privacy requirements limit some of the admin initiated lifecycle operations on my instance.

As a cloud operator, I want vTPM instances on a compute host to start back up again after a host reboot.

Proposed change

Because the security of the vTPM secret (either in Barbican or in Libvirt) affects what operations can be performed on an instance, users should be able to specify what level of security they require, and operators need to specify what level of security they’re willing to support. There also needs to be a default level applied to an instance if nothing is explicitly specified.

Three possible security levels are proposed. They are presented in the table below.

`tpm_secret_security` values
Value	Mechanism	Security implications	Instance mobility
`user`	Only the instance owner has access to the Barbican secret. This is existing behavior and will be the default behavior.	This is the most secure option, as even the Nova service user and root on the compute host cannot read the secret.	The instance is immovable and cannot be restarted by Nova in the event of a compute host crash or reboot.
`host`	The Libvirt secret is persistent and retrievable.	This is “medium” security. API-level admins and the Nova service user do not have access to the secret, but it can be accessed by users with sufficient privileges on the compute host.	The instance can be live migrated because Nova can read the secret back from Libvirt on the source host and send it to the destination over RPC. Security over the wire is left as the operator’s responsibility, but TLS or similar is assumed. The instance can also be restarted by Nova in the event of a compute host crash or reboot for the exact same reason.
`deployment`	The Nova service user owns the Barbican secret.	This is the least secure but most flexible option.	The instance can be live migrated because Nova can download the secret from Barbican and define it in Libvirt on the destination host. The instance can also be restarted by Nova in the event of a compute host crash or reboot for the exact same reason.

Users are able to choose what level they require on their instance by selecting a flavor that sets the new hw:tpm_secret_security flavor extra spec. If no specific policy was indicated in the flavor extra spec, the instance will default to the user policy, which is the same as legacy behavior.

For simplicity, if hw:tpm_secret_security is not set in the flavor extra specs, an instance with vTPM will default to the user TPM secret security policy.

A new image property is intentionally not provided because server rebuild is blocked in the API. If a user were to create a server with a given TPM secret security policy via an image property, that policy would become locked-in and unable to be changed. The user would not be able to change the image property because they would not be able to rebuild, and they would not be able to resize to a different TPM secret security policy because the image property and flavor extra spec would conflict and fail with HTTP 409.

Operators are able to specify what level they support by using the new [libvirt]supported_tpm_secret_security config option. This is a per compute host list option that can take the value of one or more of the security levels from the previous table. Its default value is all three levels. These values are exposed as driver capability traits. The hw:tpm_secret_security flavor extra spec is translated to a required trait to match the driver capabilities.

The behavior of an instance during live migration is defined by its persisted embedded flavor hw:tpm_secret_security extra spec. Instances with user cannot be live migrated. For instances with host, the source compute host reads the secret from Libvirt and sends it over RPC to the destination. For instances with deployment, the destination host downloads the secret from Barbican and defines it in Libvirt. Because the instance’s hw:tpm_secret_security value translates to a required trait, it’s guaranteed that the destination host chosen for live migration supports whatever behavior the instance requires.

Alternatives

This is the only version of this spec that covers the essentials: users of new instances can choose the security level that they require, and operators can choose which security levels they are willing to support given the limitations imposed by higher security levels.

We could also provide an image property for selection of the TPM secret security policy but it would be problematic because of the current inability to rebuild instances with vTPM (it is blocked in the API). Without the ability to rebuild a vTPM instance, any user who chose their policy via image property would be locked in to that policy unable to change it. They would not be able to change the image property value because they cannot rebuild and they would also not be able to change the policy via flavor extra spec because that would fail due to conflicting values between image property vs flavor extra spec.

If we would like to support image property in the future, we could possibly do it if we could add the ability to rebuild vTPM instances at the same time. It is not yet known if there are any technical limitations that prevent the possibility of implementing rebuild, but we could certainly investigate.

Data model impact

None.

REST API impact

No new microversion. The flavor extra spec validation code is updated to allow hw:tpm_secret_security.

Security impact

The main security consequences of this spec are the implications of the host and deployment values of hw:tpm_secret_security.

In the host case, anyone with sufficient access to the compute host can read vTPM secrets. While this is not great, it’s also something the user opts in to, and the compute host are assumed to be secured by the cloud operator.

In the deployment case, a compromise of the Nova service user leads to an exposure of all vTPM secrets. Once again, this is something the user opts in to, and the Nova service user is assumed to be secure.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

A compute service version bump is necessary.

Live migration of instances with vTPM will be blocked until the minimum service version of the deployment is the upgraded version. The cloud must be fully upgraded.

Deployers must create flavor(s) with the hw:tpm_secret_security extra spec set to host or deployment in order to enable creation of instances with the respective TPM secret security policies.

Any instances without this set are pre-existing instances and for simplicity, they will not be migrated. If a user would like to opt-in to live migration, they can resize their pre-existing instance to a flavor that has the hw:tpm_secret_security extra spec set to host or deployment.

Automatic migration of pre-existing instances into TPM secret security policies could be discussed and considered as future work.

Implementation

Assignee(s)

Primary assignee:: notartom, melwitt

Feature Liaison

Feature liaison:: melwitt, dansmith

Work Items

Introduce the hw:tpm_secret_security flavor extra spec, and [libvirt]supported_tpm_secret_security config option
Add vtpm_secret_uuid and vtpm_secret_value fields to the LibvirtLiveMigrateData object to carry the data over RPC from the source host to the destination host in the case of the host TPM secret security policy
Modify the pre live migration and rollback code to handle secret definition and cleanup
Modify the resize code to handle TPM secret security policy conversions including absence of TPM secret security policy for pre-existing instances
Bump the service version
Modify the existing API block to only allow live migration of host or deployment instances once the minimum service version has reached the bumped version
Add a whitebox/integration test
Add regular Tempest tests if possible
Update the documentation

Dependencies

Libvirt version 7.1.0. This can be enforced dynamically in code.

Testing

Nova’s functional tests are extended to test the Nova logic using the Libvirt fixture. This is particularly useful for cases that cannot be easily tested in a real environment, like rollback.

The existing whitebox-tempest-plugin vTPM tests are extended to test live migration in a real environment with an actual Libvirt.

Documentation Impact

Nova’s vTPM documentation is updated to remove the live migration limitation and explain the usage of the supported_tpm_secret_security configuration option, as well as the implications of all possible values. The expectation that vTPM state storage is not shared and that shared vTPM state storage live migration is untested is made explicit.

References

Empty.

History

Revisions
Release Name	Description
2026.1 Gazpacho	Re-proposed
2025.2 Flamingo	Re-proposed
2025.1 Epoxy	Introduced

Remove `/os-volumes_boot` API

Tue, 02 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/remove-os-volumes-boot-api

Remove the undocumented, unused /os-volumes_boot API.

Problem description

The /os-volumes_boot API is an undocumented, likely unknown alias for the /servers API. It serves no purpose other than to confuse users and clients, particularly in an era of auto-generated documentation and client tooling. We should remove it.

Use Cases

As a developer of client tooling, I do not wish to have to either support or special-case ignore an API that is not documented and duplicates existing APIs.

Proposed change

The /os-volumes_boot API and child APIs will be modified so that it returns HTTP 410 (Gone) for all resources starting from a new API microversion. While the API will continue to work for older microversions, we will mark the method with the nova.api.openstack.wsgi.removed decorator to indicate that automatic client and documentation generation tooling should ignore the API.

Alternatives

We could return HTTP 410 (Gone) for all microversions. This would be even easier for client tooling, but historically we have only done this out of necessity (typically because an underlying feature has been removed).

Data model impact

None.

REST API impact

The /os-volumes_boot API all all child APIs will return HTTP 410 (Gone) starting in the new API microversion.

Security impact

None.

Notifications impact

None.

Other end user impact

None. None of openstackclient, openstacksdk, python-novaclient, or Gophercloud currently support or use this API.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephen.finucane
Other contributors:: None

Feature Liaison

Feature liaison:: stephen.finucane

Work Items

Remove the API

Dependencies

None.

Testing

None.

Documentation Impact

We need a release note. The API is not currently documented in the api-ref so no changes will be needed there.

References

None.

libvirt driver launching instances with memory encryption by AMD SEV-ES

Tue, 02 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/amd-sev-es-libvirt-support

This spec proposes work required in order to extend the existing libvirt driver feature to launch AMD SEV-encrypted instances, to support also using AMD SEV-ES, which is the extended version of AMD SEV, as memory encryption mechanism.

Problem description

Current libvirt driver supports launching instances with memory encryption by AMD’s SEV (Secure Encrypted Virtualization) technology. However the current implementation supports only AMD SEV, and does not support new versions. For exmaple SEV-ES also encrypts all CPU register contents when a VM stops running, to achieve more complete protection of VM data, but users can’t leverage these features because of this limitation.

Note

At the time or writing AMD already released CPUs which supports SEV-SNP, but the required hypervisor features to use SEV-SNP are not yet merged into the underlying components(kernel, QEMU, libvirt and ovmf). So in this spec we focus on SEV-ES. We attempt to keep the proposal as much compatible with SEV-SNP as possible, based on the implementations published by AMD.

Use Cases

As a cloud administrator, in order that my users can have more confidence in the security of their running instances, I want to provide an image with the specific properties or a flavor with the specific extra specs which will allow users to boot instances to ensure that their instances run on an SEV-ES-capable compute host with SEV-ES encryption, instead of SEV encryption, enabled.
As a cloud user, in order to reduce data leakage risks further, I want to be able to boot VM instances with SEV-ES functionality, instead of SEV functionality, enabled.

Proposed change

We propose extending the existing implementation to support launching instances with SEV functionality.

Add detection of host SEV-ES capabilities, which checks the following items.
- The presence of the following XML in the response from a libvirt virConnectGetDomainCapabilities() API call indicates that both QEMU and the AMD Secure Processor (AMD-SP) support SEV functionality:
```
<domainCapabilities>
  ...
  <features>
    ...
    <sev supported='yes'/>
      ...
    </sev>
  </features>
</domainCapabilities>
```
  Also the maxESGuests field should be present and its value should be a positive (non-zero) value.
- /sys/module/kvm_amd/parameters/sev_es should have the value Y to indicate that the kernel has SEV capabilities enabled. This should be readable by any user (i.e. even non-root).
- Check QEMU version to determine whether the available QEMU binary supports SEV-ES.
Add the new HW_CPU_AMD_SEV_ES trait to os-traits.

Make the libvirt driver update the ProviderTree object with the correct inventory for the MEM_ENCRYPTION_CONTEXT resource class for both SEV and SEV-ES. To represent the slots dedicated for SEV and SEV-ES, nested resource providers are created per-model:

+------------+     +----------------------------+
| compute RP +--+--+ SEV RP                     |
+------------+  |  | trait:HW_CPU_AMD_SEV       |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +----------------------------+
                +--+ SEV-ES RP                  |
                   | trait:HW_CPU_AMD_SEV_ES    |
                   +------------------------+---+
                   | MEM_ENCRYPTION_CONTEXT | N |
                   +------------------------+---+

The SEV RP is named <nodename>_amd_sev and the SEV-ES RP is named <nodename>_amd_sev_es, so that the RP names are unique in the cluster.

Note

SEV and SEV-ES have separate limits of guest numbers, because ASIDs are allocated for ES guests and non-ES guests exclusively, from the total ASIDs available. Minimum ASID for SEV (non-ES) guests, which is effectively same as maxumum ASID for ES guests, should be configured in BIOS (or UEFI) to use SEV-ES. A new validation to detect insufficient ASIDs may be implemented.

Note

SEV-SNP uses the same ASID pool for ES by default when cyphertext hiding is not requested, and the new trait (such as HW_CPU_AMD_SEV_SNP) may be added to the existing SEV-ES RP when SEV-SNP support is added with a separate SEV-SNP RP with the trait corrsponding to the cyphertext hiding feature:

+------------+     +----------------------------+
| compute RP +--+--+ SEV RP                     |
+------------+  |  | trait:HW_CPU_AMD_SEV       |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +----------------------------+
                +--+ SEV-ES RP                  |
                |  | trait:HW_CPU_AMD_SEV_ES    |
                |  | trait:HW_CPU_AMD_SEV_SNP   |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +-----------------------------+
                +--+ SEV-SNP RP                  |
                   | trait:HW_CPU_AMD_SEV_SNP_CH |
                   +------------------------+----+
                   | MEM_ENCRYPTION_CONTEXT | N  |
                   +------------------------+----+

Note that SEV-SNP support is out of the current scope and this design needs further dicsussion when the support is actually implemented. It is described here to explain the potential plan to extend the RP structure in the future.

Add support for a new hw:mem_encryption_model parameter in flavor extra specs, and a new hw_mem_encryption_model image property. When either of these is set to amd-sev-es along with the parameter/propery to enable memory encryption, it would be internally translated to resources:MEM_ENCRYPTION_CONTEXT=1 and trait:HW_CPU_AMD_SEV_ES=required which would be added to the flavor extra specs in the RequestSpec object. If these new model parameter/property is absent or set to amd-sev then it would be translated to resources:MEM_ENCRYPTION_CONTEXT=1 and trait:HW_CPU_AMD_SEV=required. If conflicting models are requested by the instance flavor and the instance image (for example the flavor has hw:mem_encryption_model=amd-sev but the image has hw_mem_encryption_model=amd-sev-es) then the request is rejected. Also the request should be rejected when memory encryption is not requested but a memory encryption model is requested.
Change the libvirt driver to include extra XML in the guest’s domain definition when the hw:mem_encryption_model parameter in flavor extra spec or the hw_mem_encryption_model image property is present and is set to amd-sev-es. The extra XML is mostly similar to the one used in SEV, but its guest policy field needs the SEV-ES bit (bit 2) enabled.

Note

Guest attestation is currently out of our scope. Because the existing feature for guest attestation heavily depends on hypervisor features and is not suitable for confidential computing use case where users do not trust hypervisors. We aim to implement the guest attestation feature once SEV-SNP is generally available, because SEV-SNP provides a better mechanism for guest attestation, using the special device presented to guest machines to obtain attestation reports.

Alternatives

None

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

The end user will harness SEV-ES through the existing mechanisms of resources in flavor extra specs and image properties.

Also the limitations of AMD SEV-encrypted guest are applied when SEV-ES is used.

Performance Impact

No performance impact on nova is anticipated.

Performance impact for the other parts are same as the existing SEV support feature.

Other deployer impact

In order for users to be able to use SEV-ES, the operator will need to perform the following steps:

Deploy SEV-ES-capable hardware as nova compute hosts.
- AMD EPYC 7xx2 (Rome) or later
Set minimum ASID for SEV (non-ES) guests in BIOS (or UEFI) to a value greater than 0.

Note

If SEV-enabled instancs are already launched in the compute node, enough ASIDs should be reserved for SEV.
Ensure that they have an appropriately configured software stack, so that the various layers are all SEV-ES ready:
- kernel >= 4.16
- QEMU >= 6.0.0
- libvirt >= 8.0.0
- ovmf >= commit 7f0b28415cb4 2020-08-12
Note

SEV-ES enabled guests can be launched by libvirt >= 4.5, but detection of maximum number of SEV-ES guests via domain capability API requires libvirt >= 8.0.0 .

A cloud administrator will need to define SEV-ES-enabled flavors as described above, unless it is sufficient for users to define SEV-ES-enabled images.

The [libvirt] num_memory_encrypted_guests option is effective only for SEV, but a new option for SEV-ES is NOT added. Instead, the detection capability in libvirt is required to use SEV-ES. The num_memory_encrypted_guests option will be deprecated to reduce complexity.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: kajinamit (irc: tkajinam)
Other contributors:: None

Work Items

Add the new HW_CPU_AMD_SEV_ES trait for os-traits
Add detection of host SEV-ES capabilities as detailed above and reshaping of existing MEMO_ENCRYPTION_CONTEXT resource.
Add mem_encryption_model property to ImageMeta object
Update scheduler util to request MEM_ENCRYPTION_CONTEXT resource and HW_CPU_AMD_SEV_ES trait when the mem_encryption_model property or the equivalent flavor extra spec is set to amd-sev-es
Update libvirt driver to set the SEV-ES policy bit when the property is present.
Update image property schema in glance to validate the new mem_encryption_model property.
Update documentations.

Unit tests and functional tests should be added according to new logic.

Future work

None

Dependencies

Special hardware which supports SEV-ES for development, testing, and CI.
Recent versions of the hypervisor software stack which all support SEV-ES, as detailed in Other deployer impact above.

Testing

The fakelibvirt test driver will need adaptation to emulate SEV-ES-capable hardware.

Corresponding unit/functional tests will need to be extended or added to cover:

detection of SEV-ES-capable hardware and software, e.g. perhaps as an extension of nova.tests.functional.libvirt.test_report_cpu_traits.LibvirtReportTraitsTests
the use of a trait to include extra SEV-specific libvirt domain XML configuration, e.g. within nova.tests.unit.virt.libvirt.test_config

Documentation Impact

Update the entry in the Feature Support Matrix, to explain now AMD SEV-ES is supported in addition to AMD SEV.
Update the existing AMD SEV guide to include information about SEV-ES.

Other non-nova documentation should be updated too:

The documentation for os-traits should be extended where appropriate.

References

History

Revisions
Release Name	Description
2024.2 Dalmatian	Approved
2025.1 Epoxy	Re-proposed
2025.2 Flamingo	Re-proposed

Remove legacy v2.0 API

Tue, 02 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/remove-v20-api

Remove the legacy v2.0 API.

Problem description

Nova introduced the v2.1 API over a decade ago. Since that time, we have continued to support the legacy v2.0 API, which was reimplemented as a shim around the v2.1 API. A decade is a long time, and the Compute API has grown and changed significantly over this time, hitting the 100th microversion in the Epoxy (2025.1) release. Deploying and maintaining the legacy API has a cost and there is no good reason why anyone would continue to use this over even the base microversion. It is also mostly undocumented for an end-user perspective. We should deprecate it so that we can remove it.

Use Cases

As a developer, I no longer want to concern myself with potential differences between v2.0 and v2.1.

As a developer of deployment tooling, I would like to be able to stop deploying an additional, unused endpoint.

As a library developer, I would like to able to ignore the v2 API without feeling bad for doing so.

Proposed change

Change the API status to DEPRECATED. This will cause keystoneauth1 and recent versions of Gophercloud to ignore the API unless the user opts into it. This is a strong signal to users that the API is not long for the world, and will allow us to remove it in the H release.

Update all tests to remove confusing references to the /v2 path. In most cases, these are irrelevant since we call controllers directly and the path part of the URL is ignored, but updating things will make things clearer.

A “do not merge (DNM)” patch will be proposed removing the v2 API. This will serve to highlight any places we have missed things in the unit or functional tests.

Alternatives

Continue to pretend we support this in a meaningful way.

Data model impact

None.

REST API impact

The root version document will not report status DEPRECATED for the v2 API.

Security impact

None.

Notifications impact

None.

Other end user impact

None. All known clients use and rely on the microversioned endpoint.

Performance Impact

Negligible.

Other deployer impact

Deployment tooling will be encouraged to stop create a legacy v2 endpoint.

Developer impact

The v2 API will no longer need to be considered when undertaking work on the API. Future changes to the API frameworks used will become somewhat easier.

Upgrade impact

The v2 legacy API will be deprecated. As such, applications that rely on this and use libraries that ignore deprecated APIs (like keystoneauth and recent Gophercloud) will need to be reworked to use the v2.1 API or to opt-in to the v2 API. It is expected that there are few to none of these applications in the wild nowadays.

Implementation

Assignee(s)

Primary assignee:: stephen.finucane

Feature Liaison

Feature liaison:: N/A

Work Items

Mark the API as deprecated
Update tests to use the v2.1 API or remove paths where irrelevant
Update docs to reflect deprecation and future plans for removal

Dependencies

None.

Testing

Unit tests should cover this.

Documentation Impact

References to the v2 API will be updated to highlight the deprecation and future plans for removal.

References

None.

Support for “one time use” devices

Tue, 02 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/one-time-use-devices

As the use of direct-passthrough accelerator devices increases, so does the need for some sort of post-use cleaning workflow between tenants. A NIC that is passed directly through to a guest may need to have known-good firmware re-written to it to make sure the previous user hadn’t violated it in some way. A GPU might have sensitive residue in memory that needs to be zeroed. An NVMe device is a storage medium that needs to be wiped or discarded.

Problem description

Currently there is no good way for operators to define and execute a device-cleaning workflow outside of Nova. Further, Nova does not intend to take on such tasks itself, in support of the long-term “no more orchestration” goal.

Use Cases

As an operator, I want to provide passthrough devices to instances with known- safe firmware and device state.

As a special-purpose cloud operator, I may have specialized hardware that requires special handling after use (power or bus resets, config initialization, etc).

As a cloud operator, I want to provide fast direct-passthrough storage support, but without risking information leakage between tenants.

As a cloud operator, I want to check the write-wear indicator on my passthrough NVMe devices after each user to avoid returning devices over the safety threshold to be allocated.

Proposed change

Nova will support “one time use” devices. That is a device where we will allocate it for a new instance only once. When that instance is deleted, the device will not be returned to an allocatable state automatically (as would normally happen) and instead remain in a reserved state until some action is taken by the operator’s own workflow to mark it as usable again. Making sure such a device is not re-allocatable (until cleaned) is a potentially very security-sensitive step that can not be missed, and it makes sense for Nova to do this itself, even though it will not take on the actual task of doing any device cleaning.

The annotation mechanism here will utilize the reserved inventory count, on top of PCI-in-placement. Basically, when Nova goes to allocate the device for the instance, it will follow up with a bump of the reserved` count. When we go to de-allocate the device, we will not touch the reserved count, thus leaving the resource provider for the device fully-reserved (and thus not allocatable).

Note

This is expected to be used for PCI-in-placement and PF devices only due to the one-to-one resource provider accounting. A future change could enable this for VFs through another mechanism if we determine a need.

Through whatever workflow the operator decides, they can clean the device, and decrement the reserved count once they are ready for the device to rejoin the pool of allocatable devices again. This would likely be listening to notifications for deleted instances and scheduling such cleaning.

We will also introduce a new trait (tentatively called HW_ONE_TIME_USE) that nova will add to resource providers that it is managing as one-time-use. This will make it easier for operators to survey all the device providers that are potentially in need of cleaning. This will not convey whether or not cleaning is required (which is signaled by total=1,reserved=1,used=0) but rather that this device may need cleaning if the conditions are correct.

Implementation

The reservation of a device (i.e. “burning” its one-time-use) will happen in the compute node, (temporally) near where we do the claim and accounting in the PCI tracker. This will minimize the window for failure after which the device will be “burned” but not actually used by the instance. At the end of instance_claim() in the resource tracker, we currently call _update() which calls _update_to_placement(). There, we do some inventory and allocation healing, including of placement-tracked PCI devices. Within this inventory-healing routine, we will reserve PCI devices that are allocated since we are already auditing (and healing/updating) inventory as needed.

Note

From this point on, we will use the term “burned” to refer to a device that has been reserved such that it will not be re-allocated. This happens before the point at which the instance is able to run with it (in all situations) and remains in that state until an external action drops the reserved count back to zero. In other words, “burned” means reserved=total.

By doing this in the above described way we will get synchronous reservation of the devices (i.e. it will happen before the instance starts running) as late as is reasonable. We will also get the ability to “heal” already- allocated devices into reserved state if they happen to be marked as one-time- use by the administrator at a later time.

Move operations will function similarly, as the _move_claim() method also calls _update() synchronously after the local claims are completed. It should be noted that a move of an instance with a one-time-use device will “burn” the device on the destination as soon as it starts running there (i.e. when it reaches the verify state) and a revert will not “un-burn” it.

Lifecycle Operations

Technically one-time-use devices should be able to fully participate in all of the instance lifecycle operations. There are some caveats however, so a few cases are discussed below:

Rebuild: The device can be re-used in place without any other action
Evacuate: The original device will have been “burned” when the instance was booted and will remain as such after the original host is recovered and it removes the allocation for the original instance. The process of evacuation will allocate and burn a new device on the new host during the boot process.
Cold migrate: A new device will allocated on the destination when the instance is being started there. Once the instance reaches the verify state, the destination’s new device will be burned. On confirm, the source device will remain burned, and on revert, the destination device will have been burned. Note that state (i.e. data) on a stateful device will not be copied by Nova.
Live migrate: If the device is already live migratable, then it will be be allowed, with the source device remaining “burned” after the operation completes and of course the new device on the destination will be burned in the process of the migration.

Note

We will need a change in placement to allow over-subscribed resource providers to progress “towards safety”, meaning “become less over-subscribed”. For one-time-use devices we must be allowed to swap the instance’s allocation for the migration UUID on the source node, even though the provider is already technically over-subscribed due to the device being reserved. Note that this is already a problem in Nova/Placement and we have multiple bugs reported against this, where a change in allocation ratio resulting in over-subscription will prevent operators from migrating instances away. We need to fix this anyway, and that fix will also apply to one-time-use devices. Until then, migrate operations (cold, resize, and live) will be (implicitly) blocked for one-time-use devices. Fixing this will be slightly outside the scope of this spec, but expect to be completed in parallel or just afterwards.

Note

Evacuation without consulting the scheduler may result in us sending an instance to a host requesting a PCI device for which there was no prior check for whether it is allocatable (i.e. already burned). We need to make sure that whatever happens on the compute node in this case will fail before assigning the device to an instance (which should happen during ResourceTracker._update() as part of the allocation healing).

Alternatives

One alternative is to do nothing and continue to operate as we do today. Nova intentionally does not provide any device cleaning ability, nor any real hooks or integration for operators desiring it.

Another alternative is to say that this is in the scope of Cyborg, it is. Nova officially recognizes Cyborg as the solution for external, stateful device prep, cleaning, and lifecycle management and this does not change that. The one-time-use-devices idea sits somewhere in the middle of “do nothing” and “do it in Cyborg” in that it’s a _very_ small change to nova to allow an external integration for which we have existing APIs for people to do what they need in a simpler case. Certainly from the perspective of an operator where support for their device does not exist in Cyborg, a simpler workflow would be easier to craft a homegrown solution. For an operator with bespoke (maybe scientific) hardware, requiring them to write a full Cyborg driver in order to call a shell script after each use is a big ask.

Data model impact

There should be no data model impact if we use the existing PCI dev_spec to flag a device as one_time_use=(yes|no). This is a similar approach to the recent migrate-vfio-devices-using-kernel-variant-drivers spec which allows operators to flag them as live_migratable=(yes|no).

REST API impact

None.

Security impact

No direct security impact, although it will theoretically allow operators to improve security of device-passthrough workloads by sanitizing or re-initializing their devices between uses.

Notifications impact

None.

Other end user impact

None (invisible to users).

Performance Impact

This will involve a single additional call to placement to update the inventory after we allocate the device. This should be negligible in terms of performance impact, and the error handling will be identical to that of the case where we fail to do the allocation itself.

Other deployer impact

Deployers who do not wish to use this feature will not be impacted. Those that do will be able to enable this via config for their PCI devices and write their own external integrations based on the assumption that devices will remain reserved after allocation.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: danms

Feature Liaison

Feature liaison:: N/A

Work Items

Parse one_time_use from [pci]dev_spec config
Add code to bump reserved count when we update allocations and inventories for the PCI device in placement in the instance_claim() path
Add documentation and a sample cleanup listener script
Work on squashing placement bug_1943191 (probably in parallel)

Dependencies

This has a soft dependency on a fix to Placement that allows swapping an allocation while over-subscribed. While not strictly required, fixing this long-standing issue will enable cold migration of one-time-use devices.

Testing

This will be tested fully in unit/functional tests since it requires a real device to test with tempest.

One-off testing with real devices will be performed locally during review and submission.

Documentation Impact

Operator documentation will be added explaining the meaning of the flag, and the guarantees it makes that the operators can rely on. A sample script for processing device cleanup will be provided as a sample to start from, but extensive documentation on how to that robustly will be left to the consumer.

References

The mechanism for tagging devices is nearly identical to this recent effort:

https://specs.openstack.org/openstack/nova-specs/specs/2025.1/approved/migrate-vfio-devices-using-kernel-variant-drivers.html

History

Revisions
Release Name	Description
2025.2 Flamingo	Introduced

Policy Manager Role Default

Tue, 02 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/policy-manager-role-default

This is SRBAC goal phase-3

Problem description

Currently, compute API policy default has admin (admin in all project), project member, and project reader roles. But there are many project level APIs which should be default to user who are more privileged than normal user (member, reader role user). Instead of allowing such APIs to global admin, we should have more privileged user within project.

Use Cases

Keep project level management APIs to someone who is less privileged than admin and more privileged than project member role.

Proposed change

Keystone introduced a new role ‘manager’ role at project level. A project-manager can use project-level management APIs and intend to perform more privileged operations than project-member on its project resources.

A project-manager can use project-level management APIs and is denoted by someone with the manager role on a project. It is intended to perform more privileged operations than project-member on its project resources. A project-manager can also perform any operations allowed to a project-member or project-reader (this is handled by the keystone role implication so that the admin role implies manager, the manager role implies member, the member role implies reader). One good example for Nova to use manager role is in locking and unlocking an instance.

project-manager persona in the policy check string:

policy.RuleDefault(
    name="project_manager",
    check_str="role:manager and project_id:%(project_id)s",
    description="Default rule for project-level management APIs."
)

Using it in policy rule (with admin + manager access): (because we want to keep legacy admin behavior same we need to continue giving access of project-level management APIs to admin role too.)

policy.DocumentedRuleDefault(
    name='os_compute_api:os-migrate-server:migrate
    check_str='role:admin or (' + 'role:manager and project_id:%(project_id)s)',
    description="Cold migrate a server without specifying a host",
    operations=[
        {
            'method': 'POST',
            'path': '/servers/{server_id}/action (migrate)'
        }
    ],
)

Below APIs policy will be default to `PROJECT_MANAGER_OR_ADMIN` role

Current default: ADMIN -> New default: PROJECT_MANAGER_OR_ADMIN:

‘os_compute_api:os-migrate-server:migrate’ (“Cold migrate a server without specifying a host”)
‘os_compute_api:servers:migrations:force_complete’ (“Force an in-progress live migration for a given server “)
‘os_compute_api:servers:migrations:delete’ (“Delete(Abort) an in-progress live migration”)

Current default: PROJECT_MEMBER_OR_ADMIN -> New default: PROJECT_MANAGER_OR_ADMIN:

Note

This is making the below APIs more restrictive. Currently they are allowed for member and admin users but after this change, it will be allowed for ‘manager’ and ‘admin’ users (disallowed for ‘member’ user).

‘os_compute_api:os-deferred-delete:restore’ (“Restore a soft deleted server”)
‘os_compute_api:os-deferred-delete:force’ (“Force delete a server before deferred cleanup”)

Introducing new policy to allow more operation for ``manager`` users:

There are some APIs (listed below) which should be allowed for the manager user, but we have single policy to perform operation (migrate server) to specific host or return host info in API response. To keep host specific operation/info to admin and rest other to admin-or-manager, we need to introduce the separate new policy for host specific things which will default to admin (means no change for host specific things) and existing policy will be used for non-host things and will default to admin-or-manager

Live migrate:
- Existing policy:
  - os_compute_api:os-migrate-server:migrate_live (live migrate server)
    - Default changing from ADMIN -> PROJECT_MANAGER_OR_ADMIN
- New policy:
  - os_compute_api:os-migrate-server:migrate_live:host (live migrate server to specific host)
    - Default: ADMIN
List server (in-progress live) migration:
- Existing policy:
  - os_compute_api:servers:migrations:index (Lists in-progress live migrations for a given server)
    - Default changing from: ADMIN -> PROJECT_MANAGER_OR_ADMIN
- New policy:
  - os_compute_api:servers:migrations:index:host (Lists in-progress live migrations for a given server with host info)
    - Default: ADMIN
List migrations:
- Existing policy:
  - os_compute_api:os-migrations:index (List migrations without host info)
    - Default changing from: ADMIN -> PROJECT_MANAGER_OR_ADMIN
- New policy:
  - os_compute_api:os-migrations:index:host (List migrations with host info)
    - Default: ADMIN
  - os_compute_api:os-migrations:index:all_projects (List migrations cross projects)
    - Default: ADMIN
    - This APIs allow to list migration for all or cross projects. Because we are opening current policy index to project manager user, we need a separate new policy to control that only admin can acess all or cross project migrations and project manager can only access their own project migrations.

Note

Currently, project member can perform the below server actions. It might not be good idea to add more strict access control on them. We will continue allow project member user to perform these action. With keystone implied roles, project manager can also perform the below actions in their project servers.

‘os_compute_api:os-lock-server:lock’ (“Lock a server”)
‘os_compute_api:os-lock-server:unlock’ (“Unlock a server”)
‘os_compute_api:os-pause-server:pause’ (“Pause a server”)
‘os_compute_api:os-pause-server:unpause’ (“Unpause a paused server”)
‘os_compute_api:os-rescue’ (“Rescue a server”)
‘os_compute_api:os-unrescue’ (“Unrescue a server”)
‘os_compute_api:os-suspend-server:resume’ (“Resume suspended server”)
‘os_compute_api:os-suspend-server:suspend’ (“Suspend server”)
‘os_compute_api:servers:resize’ (“Resize a server”)
‘os_compute_api:servers:confirm_resize’ (“Confirm a server resize”)
‘os_compute_api:servers:revert_resize’ (“Revert a server resize”)
‘os_compute_api:servers:reboot’ (“Reboot a server”)
‘os_compute_api:servers:rebuild’ (“Rebuild a server”)
‘os_compute_api:servers:rebuild:trusted_certs’ (“Rebuild a server with trusted image certificate IDs”)

Alternatives

Keep admin or member do all project level management operation.

Data model impact

None

REST API impact

Below APIs policy default will be changed:

Current default: ADMIN -> New default: PROJECT_MANAGER_OR_ADMIN:

‘os_compute_api:os-migrate-server:migrate’
‘os_compute_api:servers:migrations:force_complete’
‘os_compute_api:servers:migrations:delete’
‘os_compute_api:os-migrate-server:migrate_live’
‘os_compute_api:servers:migrations:index’
‘os_compute_api:os-migrations:index’

Current default: PROJECT_MEMBER_OR_ADMIN -> New default: PROJECT_MANAGER_OR_ADMIN:

‘os_compute_api:os-deferred-delete:restore’
‘os_compute_api:os-deferred-delete:force’

Introducing below new policies default to PROJECT_MANAGER_OR_ADMIN:

‘os_compute_api:os-migrate-server:migrate_live:host’
‘os_compute_api:servers:migrations:index:host’
‘os_compute_api:os-migrations:index:host’
‘os_compute_api:os-migrations:index:all_projects’

Security impact

Provide more secure RBAC by adding project manager role to handle project resources management activities.

Notifications impact

None

Other end user impact

Below API policies default will not be allowed for ‘member’ role user, they need ‘manager’ role in their project to continue performing these operations.

‘os_compute_api:os-deferred-delete:restore’
‘os_compute_api:os-deferred-delete:force’

Performance Impact

None

Other deployer impact

The below APIs policy default is changed from member to manager role, make sure to override the required permission in policy.yaml or move the deployment to the new defaults.

‘os_compute_api:os-deferred-delete:restore’
‘os_compute_api:os-deferred-delete:force’

New policies are introduced to control the host specific operation/information. Below policies defaults are changed to allow the project ‘manager’ role also.

‘os_compute_api:os-migrate-server:migrate_live’
‘os_compute_api:servers:migrations:index’

If you have overridden the above policies with other permission, then override the same permission for the new policies also:

‘os_compute_api:os-migrate-server:migrate_live:host’
‘os_compute_api:servers:migrations:index:host’

Developer impact

New APIs must add policies that follow the new pattern.

Upgrade impact

New policies are introduced to control the host specific operation/information. Below policies defaults are changed to allow the project ‘manager’ role also.

‘os_compute_api:os-migrate-server:migrate_live’
‘os_compute_api:servers:migrations:index’

If you have overridden the above policies with other permission, then override the same permission for the new policies also:

‘os_compute_api:os-migrate-server:migrate_live:host’
‘os_compute_api:servers:migrations:index:host’

Implementation

Assignee(s)

Primary assignee:: gmaan

Feature Liaison

Feature liaison:: gmaan

Work Items

Modify the project-level management APIs defaults to manager role
Modify policy rule unit tests to use service and manager role token
Move Tempest tests of changed policies to new defaults.

Dependencies

None

Testing

Modify or add the policy unit tests. Move Tempest tests of changed policies to new defaults.

Documentation Impact

The manager role API defaults will be updated in policy rule document as well as in policy sample file.

References

History

Revisions
Release Name	Description
2025.2 Flamingo	Introduced

Policy Service Role Default

Tue, 02 Sep 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/policy-service-role-default

Ideally all internal service-to-service APIs should not be accessible by admin or end user by default. From policy defaults it should be clear which APIs are supposed to be used by admin or end user and which is for internal service-to-service APIs communication.

Problem description

Currently, internal service-to-service communication APIs have their default policy as either admin or project roles which means operators need to assign the admin or project roles to their service users. That service user having admin or project role access is poor security practice as they can perform admin or project level operations.

Another problem is that APIs which are meant to only be used by internal services are able to be called by regular users and human admins. Requiring (and allowing only) a service role for these APIs help avoid intentional and accidental abuse.

Use Cases

As an operator I want to keep service role user to access service-to-service APIs with least privilege.

Proposed change

We need to make sure all the policy rules for internal service-to-service APIs are default to service role only. Example:

policy.DocumentedRuleDefault(
    name='os_compute_api:os-server-external-events:create',
    check_str='role:service',
    scope_types=['project']
)

Keystone’s service role is kept outside of the existing role hierarchy that includes admin, member, and reader. Keeping the service role outside the current hierarchy ensures we’re following the principle of least privilege for service accounts.

We need to make all the service-to-service APIs which are only suitable for services default to service role only. But we might have some cases where APIs are both intended for service usage, as well as admin (any other user role) usage. For such policy rules we need to default them to service as well as admin (or any other user role) role. For example, ‘role:admin or role:service’

As Nova have dropped the system scope implementation, service-to-service communication with service role will be done with project scope token (which is currently done in devstack setup).

Below APIs policy will be default to service role:

os_compute_api:os-assisted-volume-snapshots:create
os_compute_api:os-assisted-volume-snapshots:delete
os_compute_api:os-volumes-attachments:swap
os_compute_api:os-server-external-events:create

Alternatives

Keep the service-to-service APIs default same as it is and expect operators to take care of the service role users access permissions by overriding it in the policy.yaml.

Data model impact

None

REST API impact

Below APIs policy will be default to service role:

os_compute_api:os-assisted-volume-snapshots:create
os_compute_api:os-assisted-volume-snapshots:delete
os_compute_api:os-volumes-attachments:swap
os_compute_api:os-server-external-events:create

Security impact

Easier to understand service-to-service APIs policy and restricting them to least privilege.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Developer impact

New APIs must add policies that follow the new pattern.

Upgrade impact

If service-to-service APIs are used by the admin or end user then make sure to override the required permission in policy.yaml because by default they will be accessed by the service role user only. If deployment overrides these policies then, they need to start considering the new default policy rules.

Implementation

Assignee(s)

Primary assignee:: gmann

Feature Liaison

Feature liaison:: dansmith

Work Items

Modify the service-to-service APIs defaults
Modify policy rule unit tests

Dependencies

None

Testing

Modify or add the policy unit tests.

Add a job enabling the new defaults and run the tempest tests to make sure existing service-service APIs communication work fine. If needed modify the token used by services as per the new defaults.

Documentation Impact

API Reference should be updated to add all the service-service APIs under separate section and mention about service role as their default.

References

History

Revisions
Release Name	Description
2023.1	Introduced
2023.2	Re-proposed
2025.2	Re-proposed

OpenAPI Schemas

Fri, 29 Aug 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/openapi-3

We would like to start documenting our APIs in an industry-standard, machine-readable manner. Doing so opens up many opportunities for both OpenStack developer and OpenStack users alike, notably the ability to both auto-generate and auto-validate both client tooling and documentation alike. Of the many API description languages available, OpenAPI (fka “Swagger”) appears to be the one with both the largest developer mindshare and the one that would be the best fit for OpenStack due to the existing tooling used in many OpenStack services, thus we would opt to use this format.

Note

This is a continuation of a spec that was previously approved in Dalmatian (2024.2) and Epoxy (2025.1). We merged all of the groundwork for this in Dalmatian and worked on the response bodies schemas in Epoxy but did not get them completed.

Problem description

The history of API description languages has been mainly a history of half-baked ideas, unnecessary complication, and in general lots of failure. This history has been reflected in OpenStack’s own history of attempting to document APIs, starting with our early use of WADL through to our experiments with Swagger 2.0 and RAML, leading to today’s use of our custom os_api_ref project, built on reStructuredText and Sphinx.

It is only in recent years that things have started to stabilise somewhat, with the development of widely used API description languages like OpenAPI, RAML and API Blueprint, as well as supporting SaaS tools such as Postman and Apigee. OpenAPI in particular has seen broad adoption across multiple sectors, with sites as varied as CloudFlare and GitHub providing OpenAPI schemas for their APIs. OpenAPI has evolved significantly in recent years and now supports a wide variety of API patterns including things like webhooks. Even more beneficial for OpenStack, OpenAPI 3.1 is a full superset of JSON Schema meaning we have the ability to re-use much of the validation we already have.

Use Cases

As an end user, I would like to have access to machine-readable, fully validated documentation for the APIs I will be interacting with.

As an end user, I want statically viewable documentation hosted as part of the existing docs site without requiring a running instance of Nova.

As an SDK/client developer, I would like to be able to auto-generate bindings and clients, promoting consistency and minimising the amount of manual work needed to develop and maintain these.

As a Nova developer, I would like to have a verified API specification that I can use should I need to replace the web framework/libraries we use in the event they are no longer maintained.

Proposed change

This effort can be broken into a number of distinct steps:

Add a new decorator for removed APIs and actions

We have a number of APIs and actions that no longer have backing code and return HTTP 410 (Gone) or HTTP 400 (Bad Request), respectively. We will not add schemas for these in the initial attempt at this so we need some mechanism to indicate this. We will add a new removed decorator that will highlight these removed APIs and indicate the version they were removed in and the reason for their removal. We can later use this as a heuristic in our tests to skip schema checks for these methods.

Note

This was completed in Dalmatian (2024.2)
Add missing request body and query string schemas

There is already good coverage of both request bodies and query string parameters but it is not complete. A list of incomplete schemas is given at the end of this section. The additional schemas will merely validate what is already allowed, which will mean extensive use of "additionalProperties": true or empty schemas. Put another way, an API that currently ignores unexpected request body fields or query string parameters will continue to ignore them. We may wish to make these stricter, as we did for most APIs in microversion 2.75, but that is a separate issue that should be addressed separately.

Once these specs are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate schemas.

Note

This was completed in Dalmatian (2024.2)
Add response body schemas

These will be sourced from existing OpenAPI schemas, currently published at github.com/gtema/openstack-openapi, from Tempest’s API schemas, and where necessary from new schemas auto-generated from JSON response bodies generated in tests and manually modified handle things like enum values.

Once these are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate response body schemas. In addition, we will add a new configuration option that will control how we do verification at the API layer, [api] response_validation. This will be an enum value with three options:

error
Raise a HTTP 500 (Server Error) in the event that an API returns an “invalid” response.

This will be the default in CI i.e. for our unit, functional and integration tests. This should not be used in production. The help text of the option will indicate this and we will set the advanced option.

warn
Log a warning about an “invalid” response, prompting operations to file a bug report against Nova.

This will be initial (and likely forever) default in production.

ignore
Disable API response body validation entirely. This is an escape hatch in case we mess up.

Note

The development of tooling required to gather these JSON Schema schemas and generate an OpenAPI schema will not be developed inside Nova and is therefore not covered by this spec. Nova will merely consume the resulting tooling for use in documentation. It is intended that the same tool will be usable across any OpenStack project that uses the same web frameworks (in Nova’s case, WebOb + Routes).

Note

The impact of middleware that modifies either the request or response will not be accounted for in this change. This is because these are configurable and they cannot be guaranteed to exist in a given deployment. Examples include the sizelimit middleware from oslo.middlware and the auth_token middleware from keystonemiddleware.

Alternatives

Use a different tool

We could use a different tool than OpenAPI to publish our specs. In a manner of speaking we already do this - albeit not in a machine-readable manner - through our use of os-api-ref.

This idea has been rejected because OpenAPI is clearly the best tool for the It is the most widely used API description language available today and aligns well with our existing use of JSON Schema for API validation. While it does not support OpenStack’s microversion API design pattern out-of-the-box, previous experiments have demonstrated that it is extensible enough to add this.
Maintain these specs out-of-tree

We could use a separate repo to store and maintain specs for Nova and the other OpenStack services.

This idea has been rejected because it prevents us testing the specs on each commit to Nova and means work that could be spread across multiple teams is instead focused on one small team. It will result in more bugs and a lag between changes to the Nova API and changes to the out-of-tree specs. It will result in duplication of effort across Nova, Tempest, and the specs projects.
Publish the spec via an API resource rather than in our docs

We could publish the spec via a new, unversioned API endpoint such as /spec. A GET request to this would return the full spec, either statically generated at deployment time or dynamically generated (and then cached) at runtime.

This is rejected because it brings limited advantages and multiple disadvantages. Nova’s API is designed to be backwards-compatible and non-extensible. As such, a user with the latest version of the spec should be able to use it to communicate with any OpenStack deployment running a version of Nova that supports microversions. It is also expected that the “master” version of the spec will continuously improve as things are tightened up, documentation is improved, and bugs or mistakes are corrected. We want consumers of the spec to see these changes immediately rather than wait for their deployment to be updated. Finally, OpenStack’s previous forays into discoverable APIs, such as Keystone’s use of JSONHome or Glance’s attempts to publish resource schemas, have seen limited take-up outside of the projects themselves. Taken together, this all suggests there is no reason or advantage to publishing deployment-specific specs and users would be better served by fetching the latest version of the spec from the api-ref documentation published on docs.openstack.org (which, one should note, is itself intentionally unversioned).

Data model impact

None.

REST API impact

There will be no direct REST API impact. Users will see HTTP 500 error if they set [api] response_validation = error and encounter an invalid response, however, we will not encourage use of this option in production and will instead focus on validating this ourselves in CI.

We may wish to address issues that are uncovered as we add schemas, but this work is considered secondary to this effort and can be tackled separately.

Security impact

None.

Notifications impact

None.

Other end user impact

This should be very beneficial for users who are interested in developing client and bindings for OpenStack. In particular, this should (after an initial effort in code generation) reduce the workload of the SDK team as well as teams outside of OpenStack that work on client tooling such as the Gophercloud team.

Performance Impact

There will be a minimal impact on API performance when validation is enabled as we will now verify both requests and responses for all API resources. Given our existing extensive use of JSON Schema for API validation, it is expected that this should not be a significant issue.

Other deployer impact

As noted previously, there will be one new config option, [api] response_validation. Operators may see increased warnings in their logs due to incomplete schemas, but most if not all of these issues should be ironed out by our CI coverage.

Developer impact

Developers working on the API microversions will now be encouraged to provide JSON Schema schemas for both requests and responses.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephenfinucane
Other contributors:: gtema

Feature Liaison

None.

Work Items

Add missing request body schemas
Add tests to validate existence of request body schemas
Add missing query string schemas
Add tests to validate existence of query string schemas
Add response body schemas
Add decorator to validate response body schemas against response
Add tests to validate existence of response body schemas

Dependencies

The actual generation of an OpenAPI documentation will be achieved via a separate tool. It is not yet determined if this tool will live inside an existing project, such as os_api_ref or openstacksdk, or inside a wholly new project. In any case, it is envisaged that this tool will handle OpenStack-specific nuances like microversions that don’t map 1:1 to OpenAPI concepts in a consistent and documented fashion.

Testing

Unit tests will ensure that schemas eventually exist for request bodies, query strings, and response bodies.

Unit, functional and integration tests will all work together to ensure that response body schemas match real responses by setting [api] response_validation to error.

Documentation Impact

Initially there should be no impact as we will continue to use os_api_ref as-is for our api-ref docs. Eventually we will replace or extend this extension to generate documentation from our OpenAPI schema.

References

APIs missing schemas

These are the APIs that are currently (as of 2024-04-11, commit 1bca24aeb) missing API request body schemas and query string schemas.

Missing request body schemas

AdminActionsController._inject_network_info
AdminActionsController._reset_network
AgentController.create
AgentController.update
BareMetalNodeController._add_interface
BareMetalNodeController._remove_interface
BareMetalNodeController.create
CellsController.create
CellsController.sync_instances
CellsController.update
CertificatesController.create
CloudpipeController.create
CloudpipeController.update
ConsolesController.create
DeferredDeleteController._force_delete
DeferredDeleteController._restore
FixedIPController.reserve
FixedIPController.unreserve
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPDNSDomainController.update
FloatingIPDNSEntryController.update
LockServerController._unlock
NetworkAssociateActionController._associate_host
NetworkAssociateActionController._disassociate_host_only
NetworkAssociateActionController._disassociate_project_only
NetworkController._disassociate_host_and_project
NetworkController.add
NetworkController.create
PauseServerController._pause
PauseServerController._unpause
RemoteConsolesController.get_rdp_console
RescueController._unrescue
SecurityGroupActionController._addSecurityGroup
SecurityGroupActionController._removeSecurityGroup
SecurityGroupController.create
SecurityGroupController.update
SecurityGroupDefaultRulesController.create
SecurityGroupRulesController.create
ServersController._action_confirm_resize
ServersController._action_revert_resize
ServersController._start_server
ServersController._stop_server
ShelveController._shelve
ShelveController._shelve_offload
SuspendServerController._resume
SuspendServerController._suspend
TenantNetworkController.create

Missing request query string schemas

AgentController.index
AggregateController.index
AggregateController.show
AvailabilityZoneController.detail
AvailabilityZoneController.index
BareMetalNodeController.index
BareMetalNodeController.show
CellsController.capacities
CellsController.detail
CellsController.index
CellsController.info
CellsController.show
CertificatesController.show
CloudpipeController.index
ConsoleAuthTokensController.show
ConsolesController.index
ConsolesController.show
ExtensionInfoController.index
ExtensionInfoController.show
FixedIPController.show
FlavorAccessController.index
FlavorExtraSpecsController.index
FlavorExtraSpecsController.show
FlavorsController.show
FloatingIPBulkController.index
FloatingIPBulkController.show
FloatingIPController.index
FloatingIPController.show
FloatingIPDNSDomainController.index
FloatingIPDNSEntryController.show
FloatingIPPoolsController.index
FpingController.index
FpingController.show
HostController.reboot
HostController.show
HostController.shutdown
HostController.startup
HypervisorsController.detail
HypervisorsController.index
HypervisorsController.search
HypervisorsController.servers
HypervisorsController.show
HypervisorsController.statistics
HypervisorsController.uptime
IPsController.index
IPsController.show
ImageMetadataController.index
ImageMetadataController.show
ImagesController.detail
ImagesController.index
ImagesController.show
InstanceActionsController.index
InstanceActionsController.show
InstanceUsageAuditLogController.index
InstanceUsageAuditLogController.show
InterfaceAttachmentController.index
InterfaceAttachmentController.show
NetworkController.index
NetworkController.show
QuotaClassSetsController.show
QuotaSetsController.defaults
QuotaSetsController.detail
QuotaSetsController.show
SecurityGroupController.show
SecurityGroupDefaultRulesController.index
SecurityGroupDefaultRulesController.show
ServerDiagnosticsController.index
ServerGroupController.show
ServerMetadataController.index
ServerMetadataController.show
ServerMigrationsController.index
ServerMigrationsController.show
ServerPasswordController.index
ServerSecurityGroupController.index
ServerTagsController.index
ServerTagsController.show
ServerTopologyController.index
ServerVirtualInterfaceController.index
ServersController.show
SnapshotController.show
TenantNetworkController.index
TenantNetworkController.show
VersionsController.show
VolumeAttachmentController.show
VolumeController.show

Note

We should emphasise that many - but not all - of the aforementioned APIs are either deprecated or removed. We may wish not to add schemas for these, though by doing so we will lose the ability to generate documentation or clients for these APIs from the OpenAPI spec.

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced. Missing query schema and request body schemas added.
2025.1 Epoxy	Re-proposed to finish response body schemas.
2025.2 Flamingo	Re-proposed to finish response body schemas.

Asynchronous Volume Attachments

Fri, 29 Aug 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/async-volume-attachments

Nova currently provides an attach-volume API call that blocks on multiple RPC calls to the compute service. The reasons for these blocking calls is mostly historical, relating to hypervisors we used to support that involve more direct interaction with the guest and thus can predict/reserve/identify the block device name that will be used.

Problem description

In eventlet (or any greenthreading scheme), blocking requests are not as expensive because the request handler is able to service other connections while waiting. However, in an environment like WSGI where each request is handled in a real thread or process, blocking requests are much more expensive.

The attach volume API is one of those blocking APIs that currently involves the API waiting for the reserve-block-device RPC call to complete. This round-trip to the compute service can be slow if the compute service is busy in general or if there’s a running action on the to-be-attaching instance, as reserving the block device name takes an instance-wide lock.

This is unfortunately somewhat pointless for the two main hypervisor drivers we currently support (libvirt and vmware) as they are unable to predict or report the block device that will be used in the guest anyway. As such, we are waiting in the API, consuming a thread and connection, for information that isn’t useful anyway.

In WSGI mode (which we are trying to get users to move to in order to deprecate and remove eventlet mode), this has been reported to be quite problematic as slow (or multiple parallel) volume attach requests can consume all the available request workers, thus causing a DoS type situation. Further, a malicious user could presumably leverage this behavior to deny or degrade service intentionally.

Use Cases

As an operator, I want to be able to deploy nova-api in WSGI mode without slow volume operations causing resource consumption issues.

As an operator, I want issues with the backend storage to not cause the nova-api service to exhaust request resources.

Proposed change

This spec proposes to introduce a new microversion in which the attach-volume call will be asynchronous and return 202 instead of 200. We will delegate the current attach-volume workflow to the conductor task api and cast or call based on the microversion used. In the async case, the user can retrieve the expected block device name the synchronous API call would have returned by retrieving it from the instance’s volume-attachments. Like before, the user needs to poll for completion of the attachment by waiting for the volume’s state in Cinder to change to in-use.

The _attach_volume() method in the current compute/api.py will be moved to the conductor task API, reachable over RPC. The API will make this delegating call to conductor for the older microversion and cast for the new one, returning the appropriate content and response code in each case.

Note

There is also an attach workflow for shelved_offloaded state which must be considered. It talks to cinder, so it may be a candidate for moving along with the main workflow, but it happens all in the API today, so it may also make sense to just leave it.

Alternatives

There was an alternative approach (see previous spec) proposed in the past which redesigns more of the attach workflow and uses traits advertised in placement to control which behavior is used. This seems overly complicated to me, while also requiring a new microversion and RPC behavior.

Data model impact

None.

REST API impact

This will introduce a new microversion, making the attach-volume API asynchronous.

Security impact

The current behavior offers somewhat of a DoS opportunity, especially when the API is running in WSGI mode. This will eliminate that possibility.

Notifications impact

None, other than the notification is currently emitted before the API call returns, and it will happen afterwards as part of this change. Since the user making the attach call is normally not a consumer of notifications, this is not likely to be noticed or cause any problems.

Other end user impact

End users who currently rely on the attach-volume API to return the expected device name in the guest will have to retrieve that information separately from the os-volume_attachments API. This might require polling or waiting for the volume-attachment to complete, because the information will not be immediately available after the attach-volume API call finishes.

Performance Impact

This work is being done to address a performance impact of exhausting resources when in WSGI mode. This work will address that, but also generally improve performance as async operations require fewer resources for the duration.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

One benefit of this approach is that the compute service and RPC API need not change. Thus, the conductor being upgraded alongside the API which uses the new task API (already required in lockstep) means that older computes will not perceive any change if the new API is used before the upgrade is complete.

Implementation

Assignee(s)

Primary assignee:: jkulik
Other contributors:: danms

Feature Liaison

This work is related to the eventlet deprecation effort, and thus should be considered a parallel effort to address issues that are being created by changing the only available deployment model we allow.

Work Items

Move the _attach_volume() method to the task API where it can be called
Add a new cast/call RPC interface for the conductor task API to perform the attach workflow
Add a new microversion to the API which controls whether the attach workflow is asynchronous or not
Add tempest coverage for the new microversion

Dependencies

No direct dependencies for this work, although it may have some impact or relation to the eventlet deprecation effort.

Testing

Typical functional and unit tests should be sufficient for this work. Existing tempest tests for volume attachment should be trivially updatable to call the new microversion, validate the return code, and poll for completion.

Documentation Impact

The typical api-ref documentation should be sufficient for this work, as well as a release note as this is likely of interest to operators currently suffering from resource exhaustion.

References

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.1 G	Introduced

Search flavors by name

Fri, 29 Aug 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/flavor-search-by-name

Allow users to search for flavor by name server-side.

Problem description

Currently, there is no mechanism to filter flavors by flavor name using the API. Instead, you must retrieve all flavors and filter manually. This can be expensive, particularly when “flavor explosion” is taken into account. We would like to resolve this by adding support for a name filter.

Use Cases

As a developer of client tooling, I would like to do as much filtering server-side as possible, in order to improve performance and reduce unnecessary network traffic.

Proposed change

Modify the GET /flavors API to add support for a new name query string filter parameter. This will support regex-style syntax, similar to many other existing APIs such as GET /servers. As with those APIs, this will default to partial matches and a regular expression must be used to get exact matches. For example:

>>> import openstack
>>> conn = openstack.connect('devstack')
>>> conn.compute.get('/flavors')
>>>
>>> [f['name'] for f in conn.compute.get(r'/flavors').json()['flavors']]
['m1.small', 'ci.m1.small', 'm1.medium', 'ci.m1.medium', 'm2.small', 'ds512M', 'ds1G']
>>>
>>> [f['name'] for f in conn.compute.get(r'/flavors?name=m1').json()['flavors']]
['m1.small', 'ci.m1.small', 'm1.medium', 'ci.m1.medium']
>>>
>>> [f['name'] for f in conn.compute.get(r'/flavors?name=^m1').json()['flavors']]
['m1.small', 'm1.medium']

This will be implemented by reusing the logic currently used for instances in the _regex_instance_filter, seen here.

While we are introducing a new microversion, we will also take the opportunity to address some other tech debt with the schema:

We will set additionalProperties to False for the flavor show (GET /flavors/{flavor_id}) API
We will remove the rxtx_factor field from the flavor create (POST /flavors), flavor list with details (GET /flavors/detail) and flavor show (GET /flavors/{flavor_id}) APIs. We will also remove rxtx_factor from the list of valid sort keys for the flavor list (GET /flavors) and flavor list with details (GET /flavors/detail) APIs. This field was only supported by the long since removed XenAPI driver and is a no-op in modern Nova.
We will remove the OS-FLV-DISABLED:disabled field from the flavor list with details (GET /flavors/detail) and flavor show (GET /flavors/{flavor_id}) APIs. There has never been a way to set this field, making it a no-op.

Finally, we will build on one of the above items and address some tech debt with other schemas:

We will set additionalProperties to False for all query string schemas.
We will restrict all action bodies to null values except those where a value is actually expected.

Alternatives

We currently have to do this stuff client-side, which is less performant. We could continue to do so.

Rather than supporting a regex syntax, we could opt for a simple partial match filter, implemented using the SQL LIKE operator. This is currently used for the hypervisor_hostname_pattern filter of the GET /os-hypervisors API (ultimately by the compute_node_search_by_hypervisor DB API). This would be slightly more performant, but it would be less expressive and would result in a potentially surprising difference in behavior compared to most other APIs.

Regex support varies between our officially supported database backends, MySQL/MariaDB and PostgreSQL, resulting in potential API behavioral differences across deployments. We could investigate a subset of regex support that is common across these backends and opt to support only this subset of patterns. However, this is likely to be an involved, potentially complicated task that would yield minimal benefit, given the long-standing bias towards MySQL in production deployments and absence of perceived issues with other APIs that already suffer from this issue. Deferring to the backend’s regex support is “good enough”.

Data model impact

None. The name field of the Flavors model already has a unique constraint and is therefore indexed. In addition, we do not plan to remove the rxtx_factor field from the Flavor o.v.o. We may wish to remove the field from the Flavors model but that should likely be done in a future release.

REST API impact

The GET /flavors API will be modified to add support for a new name query string filter parameter in requests
The POST /flavors API will be modified to remove support for the rxtx_factor parameter in requests.
All flavors API will be modified to remove the rxtx_factor and OS-FLV-DISABLED:disabled fields from responses.
All API that currently accept an unrestricted set of query string parameters will be modified to restrict these.
All action APIs that currently restrict an unrestricted value in request bodies will be modified to only accept null.

Security impact

None.

Notifications impact

None.

Other end user impact

openstackclient and third-party clients can take advantage of this when filtering flavors.

Performance Impact

None. Clients will be faster since they can take advantage of server-side filtering, but there should be no impact on the server itself since the field is indexed.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephen.finucane
Other contributors:: None

Feature Liaison

Feature liaison:: stephen.finucane

Work Items

Extend API and rework schemas as described above

Dependencies

None.

Testing

We will provide new unit and functional tests, including API sample tests.

We will extend the Compute API schemas used in Tempest to reflect these changes.

Documentation Impact

Update API ref.

References

None.

OpenAPI Schemas

Fri, 29 Aug 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/openapi-4

Note

This is a continuation of a spec that was previously approved in Dalmatian (2024.2), Epoxy (2025.1) and Flamingo (2025.2). We merged all of the groundwork for this in Dalmatian and worked on the response bodies schemas in Epoxy and Flamingo but did not get them completed.

Problem description

Use Cases

As an end user, I would like to have access to machine-readable, fully validated documentation for the APIs I will be interacting with.

As an end user, I want statically viewable documentation hosted as part of the existing docs site without requiring a running instance of Nova.

As an SDK/client developer, I would like to be able to auto-generate bindings and clients, promoting consistency and minimising the amount of manual work needed to develop and maintain these.

As a Nova developer, I would like to have a verified API specification that I can use should I need to replace the web framework/libraries we use in the event they are no longer maintained.

Proposed change

This effort can be broken into a number of distinct steps:

Add a new decorator for removed APIs and actions

We have a number of APIs and actions that no longer have backing code and return HTTP 410 (Gone) or HTTP 400 (Bad Request), respectively. We will not add schemas for these in the initial attempt at this so we need some mechanism to indicate this. We will add a new removed decorator that will highlight these removed APIs and indicate the version they were removed in and the reason for their removal. We can later use this as a heuristic in our tests to skip schema checks for these methods.

Note

This was completed in Dalmatian (2024.2)
Add missing request body and query string schemas

There is already good coverage of both request bodies and query string parameters but it is not complete. A list of incomplete schemas is given at the end of this section. The additional schemas will merely validate what is already allowed, which will mean extensive use of "additionalProperties": true or empty schemas. Put another way, an API that currently ignores unexpected request body fields or query string parameters will continue to ignore them. We may wish to make these stricter, as we did for most APIs in microversion 2.75, but that is a separate issue that should be addressed separately.

Once these specs are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate schemas.

Note

This was completed in Dalmatian (2024.2)
Add response body schemas

These will be sourced from existing OpenAPI schemas, currently published at github.com/gtema/openstack-openapi, from Tempest’s API schemas, and where necessary from new schemas auto-generated from JSON response bodies generated in tests and manually modified handle things like enum values.

Once these are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate response body schemas. In addition, we will add a new configuration option that will control how we do verification at the API layer, [api] response_validation. This will be an enum value with three options:

error
Raise a HTTP 500 (Server Error) in the event that an API returns an “invalid” response.

This will be the default in CI i.e. for our unit, functional and integration tests. This should not be used in production. The help text of the option will indicate this and we will set the advanced option.

warn
Log a warning about an “invalid” response, prompting operations to file a bug report against Nova.

This will be initial (and likely forever) default in production.

ignore
Disable API response body validation entirely. This is an escape hatch in case we mess up.

Note

Alternatives

Use a different tool

We could use a different tool than OpenAPI to publish our specs. In a manner of speaking we already do this - albeit not in a machine-readable manner - through our use of os-api-ref.

This idea has been rejected because OpenAPI is clearly the best tool for the It is the most widely used API description language available today and aligns well with our existing use of JSON Schema for API validation. While it does not support OpenStack’s microversion API design pattern out-of-the-box, previous experiments have demonstrated that it is extensible enough to add this.
Maintain these specs out-of-tree

We could use a separate repo to store and maintain specs for Nova and the other OpenStack services.

This idea has been rejected because it prevents us testing the specs on each commit to Nova and means work that could be spread across multiple teams is instead focused on one small team. It will result in more bugs and a lag between changes to the Nova API and changes to the out-of-tree specs. It will result in duplication of effort across Nova, Tempest, and the specs projects.
Publish the spec via an API resource rather than in our docs

We could publish the spec via a new, unversioned API endpoint such as /spec. A GET request to this would return the full spec, either statically generated at deployment time or dynamically generated (and then cached) at runtime.

This is rejected because it brings limited advantages and multiple disadvantages. Nova’s API is designed to be backwards-compatible and non-extensible. As such, a user with the latest version of the spec should be able to use it to communicate with any OpenStack deployment running a version of Nova that supports microversions. It is also expected that the “master” version of the spec will continuously improve as things are tightened up, documentation is improved, and bugs or mistakes are corrected. We want consumers of the spec to see these changes immediately rather than wait for their deployment to be updated. Finally, OpenStack’s previous forays into discoverable APIs, such as Keystone’s use of JSONHome or Glance’s attempts to publish resource schemas, have seen limited take-up outside of the projects themselves. Taken together, this all suggests there is no reason or advantage to publishing deployment-specific specs and users would be better served by fetching the latest version of the spec from the api-ref documentation published on docs.openstack.org (which, one should note, is itself intentionally unversioned).

Data model impact

None.

REST API impact

We may wish to address issues that are uncovered as we add schemas, but this work is considered secondary to this effort and can be tackled separately.

Security impact

None.

Notifications impact

None.

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Developers working on the API microversions will now be encouraged to provide JSON Schema schemas for both requests and responses.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephenfinucane
Other contributors:: gtema

Feature Liaison

None.

Work Items

Add missing request body schemas
Add tests to validate existence of request body schemas
Add missing query string schemas
Add tests to validate existence of query string schemas
Add response body schemas
Add decorator to validate response body schemas against response
Add tests to validate existence of response body schemas

Dependencies

Testing

Unit tests will ensure that schemas eventually exist for request bodies, query strings, and response bodies.

Unit, functional and integration tests will all work together to ensure that response body schemas match real responses by setting [api] response_validation to error.

Documentation Impact

References

APIs missing schemas

These are the APIs that are currently (as of 2024-04-11, commit 1bca24aeb) missing API request body schemas and query string schemas.

Missing request body schemas

AdminActionsController._inject_network_info
AdminActionsController._reset_network
AgentController.create
AgentController.update
BareMetalNodeController._add_interface
BareMetalNodeController._remove_interface
BareMetalNodeController.create
CellsController.create
CellsController.sync_instances
CellsController.update
CertificatesController.create
CloudpipeController.create
CloudpipeController.update
ConsolesController.create
DeferredDeleteController._force_delete
DeferredDeleteController._restore
FixedIPController.reserve
FixedIPController.unreserve
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPDNSDomainController.update
FloatingIPDNSEntryController.update
LockServerController._unlock
NetworkAssociateActionController._associate_host
NetworkAssociateActionController._disassociate_host_only
NetworkAssociateActionController._disassociate_project_only
NetworkController._disassociate_host_and_project
NetworkController.add
NetworkController.create
PauseServerController._pause
PauseServerController._unpause
RemoteConsolesController.get_rdp_console
RescueController._unrescue
SecurityGroupActionController._addSecurityGroup
SecurityGroupActionController._removeSecurityGroup
SecurityGroupController.create
SecurityGroupController.update
SecurityGroupDefaultRulesController.create
SecurityGroupRulesController.create
ServersController._action_confirm_resize
ServersController._action_revert_resize
ServersController._start_server
ServersController._stop_server
ShelveController._shelve
ShelveController._shelve_offload
SuspendServerController._resume
SuspendServerController._suspend
TenantNetworkController.create

Missing request query string schemas

AgentController.index
AggregateController.index
AggregateController.show
AvailabilityZoneController.detail
AvailabilityZoneController.index
BareMetalNodeController.index
BareMetalNodeController.show
CellsController.capacities
CellsController.detail
CellsController.index
CellsController.info
CellsController.show
CertificatesController.show
CloudpipeController.index
ConsoleAuthTokensController.show
ConsolesController.index
ConsolesController.show
ExtensionInfoController.index
ExtensionInfoController.show
FixedIPController.show
FlavorAccessController.index
FlavorExtraSpecsController.index
FlavorExtraSpecsController.show
FlavorsController.show
FloatingIPBulkController.index
FloatingIPBulkController.show
FloatingIPController.index
FloatingIPController.show
FloatingIPDNSDomainController.index
FloatingIPDNSEntryController.show
FloatingIPPoolsController.index
FpingController.index
FpingController.show
HostController.reboot
HostController.show
HostController.shutdown
HostController.startup
HypervisorsController.detail
HypervisorsController.index
HypervisorsController.search
HypervisorsController.servers
HypervisorsController.show
HypervisorsController.statistics
HypervisorsController.uptime
IPsController.index
IPsController.show
ImageMetadataController.index
ImageMetadataController.show
ImagesController.detail
ImagesController.index
ImagesController.show
InstanceActionsController.index
InstanceActionsController.show
InstanceUsageAuditLogController.index
InstanceUsageAuditLogController.show
InterfaceAttachmentController.index
InterfaceAttachmentController.show
NetworkController.index
NetworkController.show
QuotaClassSetsController.show
QuotaSetsController.defaults
QuotaSetsController.detail
QuotaSetsController.show
SecurityGroupController.show
SecurityGroupDefaultRulesController.index
SecurityGroupDefaultRulesController.show
ServerDiagnosticsController.index
ServerGroupController.show
ServerMetadataController.index
ServerMetadataController.show
ServerMigrationsController.index
ServerMigrationsController.show
ServerPasswordController.index
ServerSecurityGroupController.index
ServerTagsController.index
ServerTagsController.show
ServerTopologyController.index
ServerVirtualInterfaceController.index
ServersController.show
SnapshotController.show
TenantNetworkController.index
TenantNetworkController.show
VersionsController.show
VolumeAttachmentController.show
VolumeController.show

Note

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced. Missing query schema and request body schemas added.
2025.1 Epoxy	Re-proposed to finish response body schemas.
2025.2 Flamingo	Re-proposed to finish response body schemas.
2026.1 Gazpacho	Re-proposed to finish response body schemas.

Remove `/os-volumes_boot` API

Fri, 29 Aug 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/remove-os-volumes-boot-api

Remove the undocumented, unused /os-volumes_boot API.

Problem description

Use Cases

As a developer of client tooling, I do not wish to have to either support or special-case ignore an API that is not documented and duplicates existing APIs.

Proposed change

Alternatives

Data model impact

None.

REST API impact

The /os-volumes_boot API all all child APIs will return HTTP 410 (Gone) starting in the new API microversion.

Security impact

None.

Notifications impact

None.

Other end user impact

None. None of openstackclient, openstacksdk, python-novaclient, or Gophercloud currently support or use this API.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephen.finucane
Other contributors:: None

Feature Liaison

Feature liaison:: stephen.finucane

Work Items

Remove the API

Dependencies

None.

Testing

None.

Documentation Impact

We need a release note. The API is not currently documented in the api-ref so no changes will be needed there.

References

None.

Example Spec - The title of your blueprint

Wed, 30 Jul 2025 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.1 Gazpacho	Introduced

Example Spec - The title of your blueprint

Wed, 30 Jul 2025 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.1 Gazpacho	Introduced

Example Spec - The title of your blueprint

Wed, 30 Jul 2025 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2026.1 Gazpacho	Introduced

Example Spec - The title of your blueprint

Wed, 07 May 2025 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2025.2 Flamingo	Introduced

vTPM live migration

Wed, 16 Apr 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/vtpm-live-migration

Libvirt support for vTPM live migration now exists (more details in Problem description), but Nova changes are necessary before being able to remove the API block. This spec describes those changes.

Problem description

vTPM state storage

vTPM state storage is not the same as instance state storage. The latter can be configured to be shared, for example on NFS. The former is always non-shared. Libvirt can be told where to store the vTPM state via the source XML element, which Nova does not support. Nova deployments use the Libvirt default vTPM state path. On both Ubuntu and Red Hat operating systems, this path is /var/lib/libvirt/swtpm/<instance UUID>. This path is distinct from the instance state path and can be expected to never be on shared storage.

Thus, this spec requires vTPM state storage to be not shared, and declares live migration with shared vTPM state storage to be untested. This will be documented.

Libvirt support

Therefore, this spec requires Libvirt 7.1.0.

Secret management

Compute host reboot

For the exact same reasons (lack of Barbican secret access and inability to read the Libvirt secret back from Libvirt), Nova cannot start back up vTPM instances after a compute host reboot.

Use Cases

As a cloud operator, I want to be able to live migrate instances with vTPM devices, in particular Windows instances.

As a cloud operator, I want vTPM instances on a compute host to start back up again after a host reboot.

Proposed change

Three possible security levels are proposed. They are presented in the table below.

`tpm_secret_security` values
Value	Mechanism	Security implications	Instance mobility
`user`	Only the instance owner has access to the Barbican secret. This is existing behavior.	This is the most secure option, as even the Nova service user and root on the compute host cannot read the secret.	The instance is immovable and cannot be restarted by Nova in the event of a compute host crash or reboot.
`host`	The Libvirt secret is persistent and retrievable.	This is “medium” security. API-level admins and the Nova service user do not have access to the secret, but it can be accessed by users with sufficient privileges on the compute host.	The instance can be live migrated because Nova can read the secret back from Libvirt on the source host and send it to the destination over RPC. Security over the wire is left as the operator’s responsibility, but TLS or similar is assumed. The instance can also be restarted by Nova in the event of a compute host crash or reboot for the exact same reason.
`deployment`	The Nova service user owns the Barbican secret.	This is the least secure but most flexible option.	The instance can be live migrated because Nova can download the secret from Barbican and define it in Libvirt on the destination host. The instance can also be restarted by Nova in the event of a compute host crash or reboot for the exact same reason.

Users are able to choose what level they require on their instance by setting the new hw_tpm_secret_security image property. If this property is not set, a default can be obtained from the new hw:tpm_secret_security flavor extra spec. For operators that do not want to deal with flavor explosion as a consequence of this new extra spec, a new host configuration option is added as a fallback. Called [libvirt]default_tpm_secret_security with a default value of user (which is existing behavior), an instance with no image property or flavor extra spec will have its host’s tpm_secret_security policy persisted in its system_metadata upon booting on that host.

Operators are able to specify what level they support by using the new [libvirt]supported_tpm_secret_security config option. This is a per compute host list option that can take the value of one or more of the security levels from the previous table. Its default value is all three levels. These values are exposed as driver capability traits. The hw_tpm_secret_security image property and flavor extra spec are translated to required traits to match the driver capabilities.

The behavior of an instance during live migration is defined by its persisted hw_tpm_secret_security (either explicitly set by the user, or added by default by Nova from the host’s config option). Instances with user cannot be live migrated. For instances with host, the source compute host reads the secret from Libvirt and sends it over RPC to the destination. For instances with deployment, the destination host downloads the secret from Barbican and defines it in Libvirt. Because the instance’s hw_tpm_secret_security value translates to a required trait, it’s guaranteed that the destination host chosen for live migration supports whatever behavior the instance requires.

Alternatives

This is the only version of this spec that covers the essentials: users with existing instances are informed of the vTPM secret security level set on their instances by the operator, users of new instances can chose the security level that they require, and operators can chose which security levels they are willing to support given the limitations imposed by higher security levels.

Data model impact

The ImageMetaProps Nova object is updated to support the new hw_tpm_secret_security image property. The database schema is unaffected.

REST API impact

No new microversion. The flavor extra spec validation code is updated to allow hw:tpm_secret_security.

Security impact

The main security consequences of this spec are the implications of the host and deployment values of tpm_secret_security.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

A compute service version bump is necessary. When nova-compute starts up with the new service version, it checks all instances currently on the host. Any instances created after the service version bump have a value for hw_tpm_secret_security set in their system_metadata, either explicitly by the user or implicitly by Nova as a fallback default, as described in the <Proposed change_>_ section. Any instances without this set are pre-existing instances, and need to be upgraded. They are upgraded to the value of the [libvirt]default_tpm_secret_security value. Just persisting this in their system_metadata is not enough - their owner also needs to perform an operation with their token on the instance so that Nova can either convert the Libvirt secret to non-private and persistent in the case of host, or create a new Barbican secret with the same contents, but owned by the Nova service user, in the case of deployment. Operators have no choice but to communicate this to their users, at which point users have a choice to either opt in to the new security level, or refuse by not touching their instances or deleting them outright. In order to see what secret security level has been set on their instances by the operators, this spec depends on the Image props in server show spec, which will allow users to see the embedded image properties set on their instance, and determine the vTPM secret security level that way.

User confirmation mechanism

For existing instances, because a user token is needed to activate the host or deployment vTPM secret security policies, the presence of the embedded image property set on the instances alone will not convey whether the policy shown is currently active.

In order to track whether instances’ vTPM secret security policies are currently active, a new flag tpm_secret_security_confirmed will be set in the instance system_metadata with a value of True or False.

Existing instances will get tpm_secret_security_confirmed = False and will be switched to True if and when the user reboots their instance. If the user never touches their instance, it will remain False.

New instances will get tpm_secret_security_confirmed = True.

The value of tpm_secret_security_confirmed will be used internally by Nova to determine whether to reject an API request for live migration or not. If the vTPM secret security policy has not been confirmed, Nova API will reject a request for live migration, preserving legacy behavior for existing instances in that case.

Implementation

Assignee(s)

Primary assignee:: notartom, melwitt

Feature Liaison

Feature liaison:: melwitt, dansmith

Work Items

Introduce the hw_tpm_secret_security, hw:tpm_secret_security, [libvirt]supported_tpm_secret_security, and [libvirt]default_tpm_secret_security image properties, flavor extra specs, and config options.
Modify the pre live migration and rollback code to handle secret definition and cleanup.
Introduce the tpm_secret_security_confirmed flag in instance system_metadata.
Bump the service version.
Modify the existing API block to only allow live migration of host or deployment instances once the minimum service version has reached the bumped version.
Add a whitebox/integration test.
Update the documentation.

Dependencies

Libvirt version 7.1.0. This can be enforced dynamically in code.

Testing

Nova’s functional tests are extended to test the Nova logic using the Libvirt fixture. This is particularly useful for cases that cannot be easily tested in a real environment, like rollback.

The existing whitebox-tempest-plugin vTPM tests are extended to test live migration in a real environment with an actual Libvirt.

Documentation Impact

Nova’s vTPM documentation is updated to remove the live migration limitation and explain the usage of the supported_tpm_secret_security and default_tpm_secret_security configuration options, as well as the implications of all possible values. The expectation that vTPM state storage is not shared and that shared vTPM state storage live migration is untested is made explicit.

References

Empty.

History

Revisions
Release Name	Description
2025.2 Flamingo	Re-proposed
2025.1 Epoxy	Introduced

Enable VFIO devices with kernel variant drivers

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/enable-vfio-devices-with-kernel-variant-drivers

This spec outlines the necessary steps to enable support for SR-IOV devices using the new kernel VFIO SR-IOV variant driver interface.

Problem description

Starting with kernel 5.16 and continuing in subsequent kernels, including those in Ubuntu 24.04 (Noble Numbat) and future RHEL 10 releases, the SR-IOV mechanism for sharing Virtual Functions (VFs) with a guest has evolved.

While the older interfaces are still supported, a new interface using variant drivers has been introduced. Several devices already leverage this newer variant driver interface.

As a result, Nova should update its VFIO device support to accommodate this advancement.

Use Cases

As an operator, I want to use SR-IOV devices on Linux distributions that require variant drivers.
As an operator, I want “legacy” SR-IOV devices support to remain compatible.

Proposed change

Description:

SR-IOV devices using the variant driver interface can likely be integrated with Nova by building upon the existing PCI passthrough and SR-IOV support, combined with several modifications proposed in this specification.

According to the device documentation, users should configure the devices to be accessible as PCI Virtual Functions (VFs) identified by their PCI addresses.

Subsequently, by following the Nova documentation on attaching physical PCI devices to guests, users should arrive at a main configuration PCI section that specifies device attributes and aliases.

Configuring managed mode:

Users must specify whether the PCI device is managed by libvirt to allow detachment from the host and assignment to the guest, or vice versa. The managed mode of a device depends on the specific device and the support provided by its driver.

The proposed solution is to add a managed tag to the device specification.

managed='yes' means that nova will let libvirt to detach the device from the host before attaching it to the guest and re-attach it to the host after the guest is deleted.
managed='no' means that nova will not request libvirt to detach / attach the device from / to the host. In this case nova assumes that the operator configured the host in a way that these VFs are not attached to the host.

Note

If not set, the default value is managed=’yes’ to preserve the existing behavior, primarily for upgrade purposes.

The behavior, specifically for Nova, assumes that the devices are already bound to vfio-pci or the relevant variant driver and are directly usable without any additional operations to enable passthrough to QEMU.

Warning

Incorrect configuration of this parameter may result in host OS crashes.

When this tag is encountered by the PCI resource tracker, the corresponding information will be stored in the respective PciDevice object under the extra_info field. This allows the code responsible for generating the XML definition to configure the libvirt-managed mode with the appropriate value.

Note

The PciDevice object version remains unchanged.

Sanitize device specification:

As part of the initialization process, checks are performed to validate the correctness of the device specifications. Currently, if duplicates are present in the specifications, only the first entry is retained. While this behavior is acceptable, we may consider extending it in the future to log a warning and notify the user.

Display management:

From libvirt documentation:

An optional display attribute may be used to enable using a vgpu device as a display device for the guest. Supported values are either on or off (default). There is also an optional ramfb attribute with values of either on or off (default). When enabled, the ramfb attribute provides a memory framebuffer device to the guest. This framebuffer allows the vgpu to be used as a boot display before the gpu driver is loaded within the guest. ramfb requires the display attribute to be set to on.

There is a constraint to activate these settings for only one VGPU, even if multiple VGUs are attached to a VM.

Note

In this initial implementation, display management is out of scope, consistent with the existing mdev implementation.

Examples:

Note

The following example demonstrates device specifications and alias configurations.

[pci]
device_spec = { "vendor_id": "10de", "product_id": "25b6", "address": "0000:25:00.4", managed: "no" }

alias = { "vendor_id": "10de", "product_id": "25b6", "device_type": "type-VF", "name": "MYVF" }

Creating a VM based on the configuration above will include the following snippet in the XML definition:

<hostdev mode='subsystem' type='pci' managed='no'>
  <driver name='vfio'/>
  <source>
    <address domain='0x0000' bus='0x25' slot='0x00' function='0x4'/>
  </source>
  <alias name='hostdev0'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</hostdev>

The above example does not apply if users need to support multiple kinds of VFs.

Support for multiple kinds of VFs:

SR-IOV devices, such as GPUs, can be configured to provide VFs with various characteristics under the same vendor ID and product ID.

To enable Nova to model this, if you configure the VFs with different resource allocations, you will need to use separate resource_classes for each.

This can be achieved by following the steps below:

Enable PCI in Placement: This is necessary to track PCI devices with custom resource classes in the placement service.
Define Device Specifications: Use a custom resource class to represent a specific VF type and ensure that the VFs existing on the hypervisor are matched via the VF’s PCI address.
Specify Type-Specific Flavors: Define flavors with an alias that matches the vendor, product, and resource class to ensure proper allocation.

Device specification resource class:

This is necessary for users who want to support multiple kinds of VFs, requiring the “PCI in placement” feature to be enabled.

The resource class can user defined provided it conforms to the placement, validation requirements. While nova will normalize the resource class string to produce a valid resource class, relying on this is considered bad practice.

Normalisation is done by making the string upper case, replacing any consecutive character outside of [A-Z0-9_] with a single ‘_’, and prefixing the name with CUSTOM_ if not yet prefixed.

For example, CUSTOM_<TYPE_OF_VF> i.e. CUSTOM_GOLD_GPU would be a valid resource class.

Examples:

Note

The following example demonstrates device specifications and alias configurations, utilizing resource classes as part of the “PCI in placement” feature.

[pci]
device_spec = { "vendor_id": "10de", "product_id": "25b6", "address": "0000:25:00.4", "resource_class": "CUSTOM_A16_16A", "managed": "no" }

alias = { "device_type": "type-VF", resource_class: "CUSTOM_A16_16A", "name": "A16_16A" }

Alternatives

REST API impact

Data model impact

Only the existing extra_info free dict will be extended.

Security impact

Notifications impact

Other end user impact

Performance Impact

If PCI in placement is enabled, this bug should be taken into account as it may impact performance.

Mitigation measures are currently being developed to minimize this impact.

Other deployer impact

The user is fully responsible for configuring the following:

Host device: Define the kinds of virtual VFs required.
Compute Node: Configure device specifications, including whether the device/driver supports managed=true, along with the necessary aliases.
Flavors: If multiple kinds of VFs are needed, users must create and use different flavors for each VF type.

Developer impact

None

Upgrade impact

Users with Nvidia virtual GPUs must review their configuration.

Implementation

Assignee(s)

Primary assignee:: Uggla (René Ribaud)
Main contributors:: Bauzas (Sylvain Bauza)

Feature Liaison

Feature liaison:: N/A

Work Items

Parse managed parameter from PCI device specification.
Sanitize device specification.
Change XML generation to deal with managed parameter.
Documentation updates.
Unit tests + functional tests.

Dependencies

Performance impact bug.
PCI in placement features for multiple kinds of VFs.

Testing

Unit tests and functional tests.
Tempest and/or whitebox tests cannot be executed in CI due to hardware limitations. They can, however, be developed in parallel with this implementation and deferred for later inclusion in CI.

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Epoxy	Introduced

Image properties in server show

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/image-properties-in-server-show

This spec proposes to show an instance’s embedded image properties in the server show API. This has lots of uses, but is particularly required for vTPM live migration to show users the vTPM secret security level that is set on their instances.

Problem description

Nova copies the properties of the image into the instance system metadata at instance create and rebuild to keep this information available even if the image is changed or deleted later in glance. However the nova API does not return this authoritative information to the user. As image properties can affect how the instance is scheduled and what features are enabled for it in the hypervisor this information is very useful for the user.

Use Cases

I as the owner of the VM would like to know the image properties used by nova when scheduling and building my VM even after the image is changed or deleted in glance.
Especially I as the owner of an existing VM want to see the hw_vtpm_secret_security in the embedded image properties so that I can observe the default vTPM security mode applied to of my VM before I consent to such security change. See vTPM live migration.
I as the owner of the VM want to detect if the admin needed to change any image properties in my behalf via nova-mange image_property set.

Proposed change

In a new API microversion return the embedded image properties in the GET /server/details, GET /server/{server_id} and the rebuild case of POST /server/{server_id}/action responses.

The implementation needs to populate this part of the api response from our cache of the image details in instance.system_metadata.

Alternatives

Implement separate top level fields for each feature depending on an image properties.

Data model impact

No impact as the image properties are already modelled and persisted today.

REST API impact

In a new microversion the following API responses are extended:

GET /server/details
GET /server/{server_id}
POST /server/{server_id} where the action is rebuild

A new properties subkey will be added under the struct at the existing image key as a dict where both the keys and the values are following the schema ^[a-zA-Z0-9-_:. ]{1,255}$.

The new subkey will be included in the response with the current default policy of these APIs, which is PROJECT_READER_OR_ADMIN.

Response example:

{
  "servers": [
    {
      "id": "65fc9d2f-1d02-4bb0-8602-b505252b17f8",
      "name": "vm1",
      "status": "ACTIVE",
...
      "image": {
        "id": "197c0527-f0f8-4f94-9ccc-82759bf0dc21",
        "links": [
...
        ],
        "properties": {
          "hw_machine_type": "pc-q35-8.2",
          "hw_vtpm_secret_security": "host",
          "hw_tpm_version": "2.0",
          "hw_tpm_model": "tpm-crb"
...
        },
      },
      "locked": false,
...
    }
  ]
}

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None, the system_metadata is already loaded from the DB when the API response is generated since microversion 2.73

Other deployer impact

None

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: ?

Feature Liaison

Feature liaison:: balazs-gibizer

Work Items

In a new API microversion extend the API response

Dependencies

None

Testing

Unit test
API sample functional test

Documentation Impact

API ref

References

None

History

Revisions
Release Name	Description
2025.1 Epoxy	Introduced

libvirt SPICE direct consoles

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-spice-direct-consoles

This specification proposes modifications to Nova’s libvirt driver to support “direct” SPICE VDI consoles. These consoles are “direct” in that they are not intended to use a HTML5 transcoding proxy to access, and instead the user would use a native SPICE client like remote-viewer. Such a facility enables a much richer virtual desktop experience that Nova current supports, in return for relatively minor changes to Nova. A new Nova API microversion is also required to add this new console type.

While exposing the SPICE TCP ports on the hypervisor to the internet is not advisable, this facility allows a SPICE protocol native proxy to channel traffic from users to the correct hypervisor ports. In order to ensure that the hypervisor port information is protected, it is only exposed in the API to callers with admin permissions.

Problem description

The SPICE protocol was added to Nova a long time ago, and still represents the richest and most performant option for remote desktops using Nova. However at the moment, Novas’s HTML5 transcoding proxy is the only way to access these SPICE consoles, and the HTML5 interface does not support many of the more novel features of the SPICE protocol, nor does it support high resolution desktops well.

Use Cases

As a developer, I don’t want these changes to make the Nova codebase even more complicated. The changes proposed are relatively contained – a single new API microversion, two additional extra specs (for sound and USB passthrough) with associated domain XML generation code, and associated tests.

As a deployer, I want to be able to use OpenStack to provide rich virtual desktops to my users. This change facilitates such functionality, but does require additional deployment steps such as setup to TLS certificates for your hypervisors and management of a SPICE native proxy. There is a sample implementation using Kolla-Ansible available, but other deployment systems would need to integrate this functionality for it to be generally available.

As a deployer who doesn’t want rich desktop consoles, I don’t want this functionality to complicate my deployment. When disabled, the changes to deployments are minor – for example the extra USB passthrough devices and sound devices in the domain XML are all disabled unless requested by the relevant extra specs.

As an end user, I would like access to a richer desktop experience than is currently available. Once these changes are integrated and a SPICE native proxy deployed, a further change to either Horizon or Skyline will be required to orchestrate console access. It is expected the complete end to end functionality will take several releases to land before a fully seamless experience is available. Once fully implemented, Horizon and Skyline will be capable of delivering a .vv configuration file for a specific console to a client, who will then have seamless access to their virtual desktop. However, a user will be able to use the openstack console url show command immediately to create a console session outside of our web clients.

Proposed change

The proposed solution is relatively simple – add an API microversion which makes it possible to create a “spice-direct” console, and to lookup connection details for that console from the API. The new console type and microversion is required because we need to be able to specify the new console type, which is an API schema change.

The response from a get_spice_console or create call which requests a “spice-direct” console will return a URL derived from CONF.spice.spice_direct_proxy_base_url, and will include a console access token. The user would then request this URL, and the SPICE native proxy would lookup console connection details from nova via the /os-console-auth-tokens/ API. These details would be used to generate a virt-viewer .vv` configuration file, which the user can then use to access a proxied SPICE console.

Because the response from /os-console-auth-tokens/ includes the host and port on the hypervisor that the SPICE console is running on, it is agreed that these API methods should have restricted accessibility. However, this is a pre-existing API and this should already be true. This protects sensitive network configuration information from being provided to less trusted users.

This specification also covers tweaks the to the libvirt domain XML to enrich the desktop experience provided by such a direct console, such as:

USB device passthrough from client to guest via extra spec configurable usbredir support (WIP implementation at I791b16c5bf0e860a188783c863e95dc423998b0a)
sound support via an extra spec to specify a sound device (WIP implementation at I2faeda0fd0fb9c8894d69558a1ccaab8da9f6a1b)

Note that allowing concurrent console access from more than one user is technically feasbile, but forbidden by Nova’s policy of not manipulating the qemu` command line directly. See I65f94771abdc1a6ef54637ea81f25ce1daaf4963 for discussion on that issue.

The proposed changes allow direct connection to a SPICE console from a SPICE native client like remote-viewer. Without additional software, this implies that such a client would have network connectivity to relatively arbitrary TCP ports on the hypervisor hosting the instance. However, a SPICE protocol native proxy now exists, and a parallel proposal to this one proposes adding support for it to Kolla-Ansible. This proxy is called Kerbside, and more details are available at https://github.com/shakenfist/kerbside. That is, with the proxy deployed there is effectively no change to the network exposure of Nova hypervisors.

When implemented, a user can fetch a Kerbside connection URL like this:

The user then fetches that URL, and Kerbside delivers a .vv` file with the connection information for a SPICE client. Kerbside uses a call to /os-console-auth-tokens/bf2e6883-… to determine the validity of the console authentication token, and the connection information for the console.

Alternatives

Unfortunately the SPICE HTML5 proxy does not meet the needs to many remote desktop users. Realistically OpenStack does not currently have a way of providing these rich desktop consoles to users. Instead, other systems such as Citrix are used for this functionality.

Data model impact

The console auth token table needs to have an extra column added so that TLS ports can be tracked alongside unencrypted ports. This change is minor and should not be difficult for deployers to support as this table should not be particularly large given authentication tokens already expire.

REST API impact

This specification adds a new console type, “spice-direct”, which provides the connection information required to talk the native SPICE protocol directly to qemu on the hypervisor. This is intended to be fronted by a proxy which will handle authentication separately.

A new microversion is introduced which adds the type “spice-direct” to the existing “spice” protocol.

This implies that the JSON schema for create console call would change to something like this:

create_v297 = {
    'type': 'object',
    'properties': {
        'remote_console': {
            'type': 'object',
            'properties': {
                'protocol': {
                    'type': 'string',
                    'enum': ['vnc', 'spice', 'rdp', 'serial', 'mks'],
                },
                'type': {
                    'type': 'string',
                    'enum': ['novnc', 'xvpvnc', 'spice-html5',
                             'spice-direct', 'serial', 'webmks'],
                },
            },
            'required': ['protocol', 'type'],
            'additionalProperties': False,
        },
    },
    'required': ['remote_console'],
    'additionalProperties': False,
}

And that the JSON schema for the get_spice_console would change to something like this:

get_spice_console_v297 = {
    'type': 'object',
    'properties': {
        'os-getSPICEConsole': {
            'type': 'object',
            'properties': {
                'type': {
                    'type': 'string',
                    'enum': ['spice-html5', 'spice-direct'],
                },
            },
            'required': ['type'],
            'additionalProperties': False,
        },
    },
    'required': ['os-getSPICEConsole'],
    'additionalProperties': False,
}

The response from /os-console-auth-tokens/ also needs to be tweaked to return a TLS port if one is configured for the console, which will require a response schema change.

Security impact

This proposal has a medium security impact. While hypervisor host / port details will only be exposed to requestors that have the service role or admin permissions, Kerbside does need to have network connectivity to the SPICE TCP ports on the hypervisors in the cloud. However, Kerbside provides a protective layer to these TCP ports, and it is not intended to expose this information to less privileged requestors.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

As discussed, a complete implementation requires deployment systems to integrate the Kerbside SPICE proxy, as well as modifications to front ends such as Horizon and Skyline to orchestrate consoles via Kerbside. However, those are outside the scope of a Nova specification.

The following configuration options are added by the proposed changes:

spice.spice_direct_proxy_base_url: defaults to an example URL which wouldn’t actually work for a non-trivial installation (just as the HTML5 transcoding proxy does). This is the base URL for the Kerbside URLs handed out by Nova.
spice.require_secure: defaults to False, the current hard coded default. Whether to require secure TLS connections to SPICE consoles. If you’re providing direct access to SPICE consoles instead of using the HTML5 proxy, you may wish those connections to be encrypted. If so, set this value to True. Note that use of secure consoles requires that you setup TLS certificates on each hypervisor.

The following additional image properties will be added:

hw_audio_model: defaults to None, the current hard coded default. Whether to include a sound device for instance when SPICE consoles are enabled, and if so what type.
hw_usb_model: defaults to None, the current hard coded default. This is required if hw_redirected_usb_ports is to be configured.
hw_redirected_usb_ports: defaults to None, the current hard coded default. If configured, this specifies the number of usbredir devices created within the instance domain XML.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: mikal
Other contributors:: None

Feature Liaison

Liaison needed.

Work Items

All code is currently proposed for review in Gerrit.

Dependencies

None.

Testing

Testing graphical user interfaces in the gate is hard. However, a test for the API microversion will be added, and manual testing of the console functionality has occurred on the prototype and will be redone as the patches land.

Documentation Impact

The Operators Guide will need to be updated to cover the new functionality and configuration options. The End User’s guide will need to be updated to explain usage once the functionality is fully integrated.

References

None.

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced
2025.1 Epoxy	Updated and reproposed

Allow Manila shares to be directly attached to an instance when using libvirt

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-virtiofs-attach-manila-shares

Manila is the OpenStack Shared Filesystems service. This spec will outline API, database, compute and libvirt driver changes required in Nova to allow the shares provided by Manila to be associated with and attached to instances.

Problem description

At present users must manually connect to and mount shares provided by Manila within their instances. As a result operators need to ensure that Manila backend storage is routable from the guest subnets.

Use Cases

As an operator I want the Manila datapath to be separate to any tenant accessible networks.
As a user I want to attach Manila shares directly to my instance and have a simple interface with which to mount them within the instance.
As a user I want to detach a directly attached Manila share from my instance.
As a user I want to track the Manila shares attached to my instance.

Proposed change

This initial implementation will only provide support for attaching a share to and later detaching a share from an existing SHUTOFF instance. The ability to express attachments during the initial creation of an instance will not be covered by this spec.

Support for move operations once a share is attached will also not be covered by this spec, any requests to cold migrate evacuate, live migrate rebuild, resize, shelve, suspend, or volume snapshot an instance with a share attached will be rejected with a HTTP409 response for the time being.

A new server shares API will be introduced under a new microversion. This will list current shares, show their details and allow a share to be attached or detached.

A new share_mapping database table and associated ShareMapping versioned objects will be introduced to capture details of the share attachment. A base ShareMapping versioned object will be provided from which virt driver and backend share specific objects can be derived providing specific share attach and detach implementations.

Note

One thing to note here is that no Manila state will be stored within Nova aside from export details used to initially attach the share. These details later being used when detaching the share. If the share is then reattached Nova will request fresh export details from Manila and store these in a new share attachment within Nova.

The libvirt driver will be extended to support the above with initial support for cold attach and detach. Future work will aim to add live attach and detach as it is now supported by libvirt.

This initial libvirt support will target the basic NFS and slightly more complex CephFS backends within Manila. Shares will be mapped through to the underlying libvirt domains using virtio-fs. This will require QEMU >=5.0 and libvirt >= 6.2 on the compute host and a kernel version of >= 5.4 within the instance guest OS.

Additionally this initial implementation will require that the associated instances use file backed memory or huge pages. This is a requirement of virtio-fs as the virtiofsd service uses the vhost-user protocol to communicate directly with the underlying guest. (ref: vhost-user documentation)

Two new compute capability traits and filters will be introduced to model an individual compute’s support for virtio-fs and file backed memory. And while associating a share to an instance, a check will ensure the host running the instance will support the

COMPUTE_STORAGE_VIRTIO_FS trait

and either the

COMPUTE_MEM_BACKING_FILE trait

that the instance is configured with hw:mem_page_size extra spec.

From an operator’s point of view, it means COMPUTE_STORAGE_VIRTIO_FS support requires that operators must upgrade all their compute nodes to the version supporting shares using virtiofs.

COMPUTE_MEM_BACKING_FILE support requires that operators configure one or more hosts with file backed memory. Ensuring the instance will land on one of these hosts can be achieved by creating an AZ englobing these hosts. And then instruct users to deploy their instances in this AZ. Alternatively, operators can guide the scheduler to choose a suitable host by adding trait:COMPUTE_MEM_BACKING_FILE=required as an extra spec or image property.

Users will be able to mount the attached shares using a mount tag, this is either the share UUID from Manila or a string provided by the users with their request to attach the share.

user@instance $ mount -t virtiofs $tag /mnt/mount/path

A previously discussed os-share library will not be created with this initial implementation but could be in the future if the logic required to mount and track shares on the underlying host is also required by other projects. For the time being existing code within the libvirt driver used to track filesystem host mounts used by volumes hosted on remoteFS based storage (such as NFS, SMB etc) will be reused as much as possible.

Share mapping status:

                     +----------------------------------------------------+   Reboot VM
    Start VM         |                                                    | --------------+
    Share mounted    |                       active                       |               |
+------------------> |                                                    | <-------------+
|                    +----------------------------------------------------+
|                      |                   |             |
|                      | Stop VM           |             |
|                      | Fail to umount    |             |
|                      v                   |             |
|                    +------------------+  |             |
|                    |      error       | <+-------------+-------------------+
|                    +------------------+  |             |                   |
|                      |                   |             |                   |
|                      | Detach share or   |             |                   |
|                      | delete VM         | Delete VM   |                   |
|                      v                   |             |                   |
|                    +------------------+  |             |                   |
|    +-------------> | detaching --> φ  | <+             |                   | Start VM
|    |               +------------------+                |                   | Fail to mount
|    |                 |                                 |                   |
|    | Detach share    |                                 | Stop VM           |
|    | or delete VM    | Attach share                    | Share unmounted   |
|    |                 v                                 v                   |
|    |               +----------------------------------------------------+  |
|    +-------------- |          attaching --> inactive                    | -+
|                    +----------------------------------------------------+
|                      |
+----------------------+

φ means no entry in the database. No association between a share and a server.

Attach share: means POST /servers/{server_id}/shares
Detach share: means DELETE /servers/{server_id}/shares

This chart describe the share mapping status (nova), this is independent from the status of the Manila share.

Share attachment/detachment can only be done if the VM state is STOPPED or ERROR.

The operation to start a VM might fail if the attachment of an underlying share fails or if the share is not in an inactive state.

Note

In such scenarios, the instance will be marked as ERROR. Subsequent attempts to start the VM will necessitate a hard reboot by the user, in line with standard procedures for such kind of situations. This error handling will be centralized and managed by the compute host.

Mount operation will be done when the share is not mounted on the compute host. If a previous share would have been mounted on the compute host for another server, then it will attempt to mount it and a warning will be logged that the share was already mounted.

Umount operation will be really done when the share is mounted and not used anymore by another server.

With the above mount and umount operation, the state is stored in memory and do not require a lookup in the database.

The share will be mounted on the compute host using read/write mode. Read-only will not be supported as a share could not be mounted read-only and read/write at the same time. If the user wants to mount the share read-only, it will have to do it in the VM fstab.

Instance Deletion Processes:

Standard Deletion:

During a normal deletion process on the compute side, both the unmount and Manila policy removal are attempted.
- If both operations succeed, the corresponding share mapping is also removed.
- If either the unmount or policy removal fails, the instance itself is deleted, but a share mapping record may remain in the database. A future enhancement will include a periodic task designed to unmount, remove the policy, and clean up any leaked share mappings.

Local Deletion:

When the VM is marked as DELETED in the database due to unavailable compute during the delete request, no unmounting or Manila policy removal occurs via the API.
- Once the compute is operational again, it identifies instances marked as DELETED that have not yet been cleaned up. During the initialization of the instance, the compute attempts to complete the deletion process, which includes unmounting the share and removing the access policy.
  - If these actions are successful, the share mapping will be removed.
  - If either action fails, the deletion remains incomplete; however, the compute’s startup process continues unaffected, and the error is merely logged. For security reasons, it’s crucial not to retain the mount, necessitating a retry mechanism for cleanup. This situation parallels the standard deletion scenario and requires a similar periodic task for resolution.

Manila share removal issue:

An issue was identified in the Zed cycle, a share being used by instances could be removed by the user. As a result, the instances would loose access to the data and might cause difficulties in removing the missing share and fixing the instance.

A solution was identified with the Manila team to attach metadata to the share access policy that will lock the share and prevent its deletion until the lock is not removed.

This solution was implemented in the Antelope cycle. The proposal here will use the lock mechanism in Nova.

Instance metadata:

Add instance shares in the instance metadata. Extend DeviceMetadata with ShareMetadata object containing share_id and tag used to mount the virtiofs on an instance by the user. See Other end user impact.

Alternatives

The only alternative is to continue with the current situation where users must mount the shares within their instances manually. The downside being that these instances must have access to the storage network used by the Manila backends.

REST API impact

A new server level shares API will be introduced under a new microversion with the following methods:

GET /servers/{server_id}/shares

List all shares attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "shares": [
        {
            "share_id": "48c16a1a-183f-4052-9dac-0e4fc1e498ad",
            "status": "active",
            "tag": "foo"
        },
        {
            "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar"
        }
    ]
}

GET /servers/{server_id}/shares/{share_id}

Show details of a specific share attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "share": {
        "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar"
    }
}

PROJECT_ADMIN will be able to see details of the attachment id and export location stored within Nova:

{
    "share": {
        "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

POST /servers/{server_id}/shares

Attach a share to an instance.

Prerequisite(s):

Instance must be in the SHUTOFF state.
Instance should have the required capabilities to enable virtiofs (see above).

This API operates asynchronously. Consequently, the share_mapping is defined and it status is marked as “attaching” in the database.

In the background, the compute node will request Manila to grant access to the share and lock it for nova usage. Once this process is complete, the share status is changed to inactive. It’s important to note that locking the share also restricts visibility to users to prevent any inadvertent exposure of internal data.

Following that, when the VM is powered on, the share will be mounted onto the compute node and designated as active provided there are no errors. Conversely, when the VM is powered off, the share will be unmounted from the compute node and marked as inactive, again, if there are no errors encountered.

Return Code(s): 202,400,401,403,404,409

Request body:

Note

tag will be an optional request parameter in the request body, when not provided it will be the share_id(UUID) as always provided in the request.

tag if povided by the user must be an ASCII string with a maximum lenght of 64 bytes.

{
    "share": {
        "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

Response body:

{
    "share": {
        "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

DELETE /servers/{server_id}/shares/{share_id}

Detach a share from an instance.

Prerequisite(s): Instance must be in the SHUTOFF or ERROR state.

This API functions asynchronously, leading to the share_mapping status being marked as detaching.

Concurrently, the compute system conducts a verification to see if the share is no longer being utilized by another instance. If found unused, it requests Manila to unlock the share and deny access.

To maintain consistent logic for both NFS and CephFS, we currently remove the access policy only after the last user has unmounted the share across all compute systems. While NFS could potentially implement an access policy based on per-compute IP, CephFS currently employs an access token specific to each Nova user. In the future, we may explore utilizing a CephFS user/token that is specific to each Nova instance on each compute system.

Two checks are necessary:

To unmount, it’s important to verify whether any other virtual machines are using the share on the same compute system. This mechanism is already implemented by the driver.
For removing the access policy, we need to ensure that no compute system is currently using the share. Once this process is finalized, the association of the share is eliminated from the database.

Return Code(s): 202,400,401,403,404,409

Data model impact

A new share_mapping database table will be introduced.

id - Primary key autoincrement
uuid - Unique UUID to identify the particular share attachment
instance_uuid - The UUID of the instance the share will be attached to
share_id - The UUID of the share in Manila
status - The status of the share attachment within Nova (attaching, detaching, active, inactive, error)
tag - The device tag to be used by users to mount the share within the instance.
export_location - The export location used to attach the share to the underlying host
share_proto - The Shared File Systems protocol (NFS, CEPHFS)

A new base ShareMapping versioned object will be introduced to encapsulate the above database entries and to be used as the parent class of specific virt driver implementations.

The database field status and share_proto values will not be enforced using enums allowing future changes and avoid database migrations. However, to make code more robust, enums will be defined on the object fields.

Fields containing text will use String and not Text type in the database schema to limit the column width and be stored inline in the database.

This base ShareMapping object will provide stub attach and detach methods that will need to be implemented by any child objects.

New ShareMappingLibvirt, ShareMappingLibvirtNFS and ShareMappingLibvirtCephFS objects will be introduced as part of the libvirt implementation.

Security impact

The export_location JSON blob returned by Manila and used to mount the share to the host and the host filesystem location should not be logged by Nova and only accessible by default through the API by admins.

This export_location field will also be excluded from notifications by choice.

The Nova abstraction with the Openstack SDK needs to be updated so that, when a user requests Nova to attach a Manila share to their instance, Nova utilizes the user’s Keystone token when communicating with Manila. This ensures that Manila can properly verify the user’s access to the requested share.

Notifications impact

New notifications will be added:

One to add new notifications for share attach and share detach.
One to extend the instance update notification with the share mapping information.

Share mapping in the instance payload will be optional and controlled via the include_share_mapping notification configuration parameter. It will be disabled by default.

Proposed payload for attached and detached notification will be the same as the one returned by the show command with admin rights.

{
    "share": {
        "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
        "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
    }
}

Proposed instance payload for instance updade, will be the list of share attached to this instance.

{
    "shares":
    [
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar",
        },
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "share_id": "e8debdc0-447a-4376-a10a-4cd9122d7987",
            "status": "active",
            "tag": "baz",
        }
    ]
}

Other end user impact

Users will need to mount the shares within their guestOS using the returned tag.

Users could use the instance metadata to discover and auto mount the share.

Performance Impact

Through the use of vhost-user virtio-fs should have near local (mounted) file system performance within the guestOS. While there will be near local performance between the vm and host, the actual performance will be limited by the network performance of the network file share protocol and hardware.

Other deployer impact

None

Developer impact

None

Upgrade impact

A new compute service version and capability traits will be introduced to ensure both the compute service and underlying virt stack are new enough to support attaching a share via virtio-fs before the request is accepted.

A new DB migration constraint to prevent a share to be attached more than once will be introduced. Because the share_mapping table was never able to be utilized in production, it is proposed that the table be dropped and then reconstructed with the updated constraint. This approach will help standardize the process across all database systems, as sqlite does not allow altering table constraints, requiring the table to be recreated.

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)
Other contributors:: lyarwood (initial contributor)

Feature Liaison

Feature liaison:: uggla

Work Items

Add new capability traits within os-traits
Add support within the libvirt driver for cold attach and detach
Add new shares API and microversion

Dependencies

None

Testing

Functional libvirt driver and API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Yoga	Introduced
Zed	Reproposed
Antelope	Reproposed
Bobcat	Reproposed
Caracal	Reproposed
Dalmatian	Updated and reproposed
Epoxy	Updated and reproposed

Live migrate VFIO devices using kernel variant drivers

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/migrate-vfio-devices-using-kernel-variant-drivers

This spec outlines the necessary steps to live migrate SR-IOV devices using the new kernel VFIO SR-IOV variant driver interface.

Problem description

Support for devices using the variant driver interface is detailed in this specification.

However, the migration process is not covered there. This is addressed in the following section, which describes the Nova updates required for SRIOV devices using VFIO SR-IOV variant driver to be live migrated to other hosts supporting the same devices.

Use Cases

As an operator, I want to live migrate VMs with SR-IOV devices if such operation is supported by the variant driver.
As an operator, I want to declare whether a device is live migratable or non-live migratable.
As an operator, I want to define flavors that use live migratable or non-live migratable devices.

Proposed change

Description:

Configuring PCI device specification:

Administrator must specify whether the device is eligible for live migration to a similar device on another compute node.

The proposed solution is to add a live_migratable tag to the device specification in [pci]dev_spec config.

live_migratable='yes' means that the device can be live migrated.
live_migratable='no' means that the device cannot be live migrated.

When this tag is encountered by the PCI resource tracker, the corresponding information will be stored in the respective PciDevice object under the extra_info field.

If not specified, the default behavior will be equivalent to live_migratable=’no’. However, this value will not be persisted in the PciDevice object.

Note

The PciDevice object version remains unchanged.

Additionally, if pci in placement is enabled and live_migratable='yes', it will record a new standard trait, HW_PCI_LIVE_MIGRATABLE, in the resource provider representing the physical device. While this trait will not be utilized by the migration flow, it can serve as a reference for inventory and later the PCI in Placement code path can be extended to automatically request this trait if the PCI alias requests live_migratable=yes device(s).

Note

Since this is not mandatory for the migration, it will be included in separate commits.

Configuring PCI aliases:

Users must specify whether the PCI request, and consequently the flavor, requires a live migratable device.

The proposed solution is to add a new live_migratable key to the PCI alias definition in the [pci]alias config.

live_migratable='yes' means that the user wants a device(s) allowing live migration to a similar device(s) on another host.
live_migratable='no' This explicitly indicates that the user requires a non-live migratable device, making migration impossible.
If not specified, the default is live_migratable=None, meaning that either a live migratable or non-live migratable device will be picked automatically. However, in such cases, migration will not be possible.

Live migration modifications:

Verify in _check_can_migrate_pci() whether the source instance contains live migratable devices. If no live migratable devices are found, raise an exception indicating that the migration is not possible.

Note

The VM on the source host might have PCI devices attached that are not related to any PCI alias, but it is there because of neutron direct or direct-physical ports. In this case nova should do what it does today, detach these ports at the start of the migration and re-attach them on the dest after the migration. Also such PCI devices having no live_migratable=yes key in their extra_info should not prevent the live migration to be accepted.

Modify stats.py in the filter_pools() function to handle PCI requests for live_migratable devices. Ensure it retrieves hosts with the appropriate number of live migratable devices by adding a new filter.

Since VIF field is not used in this context, we need to claim PCI devices and retrieve the PCI addresses of the destination host.

Update the LiveMigrateData object to include the PCI device mapping between the source and destination device addresses. A new field, pci_dev_map_src_dst, defined as a DictOfStringsField will be added to the LiveMigrateData object for this purpose.

Update the _live_migration_operation() function, with a specific focus on the get_updated_guest_xml() function, to map the source PCI addresses to the destination addresses in the destination XML file using the data provided by the LiveMigrateData object.

Note

If PCI in Placement is enabled then live migration will work as today for neutron requested PCI devices (i.e. legacy behavior works)
If PCI in Placement is enabled then SR-IOV live migration proposed in this spec will still work (i.e. new functionality works)
Optionally PCI in Placement will be extended to automatically request HW_PCI_LIVE_MIGRATABLE trait if the alias has live_migratable=”yes”.
- A further enhancement would be to extend the translation of the [pci]alias spec to placement RequestGroups to support forbidden traits. So when live_migratable=no is present in the alias the HW_PCI_LIVE_MIGRATABLE trait is requested as forbidden.

For NICs such as the Mellanox ConnectX-7, if both live_migrate=yes and physical_network=”label” are set, the migration mechanism defined in this specification will be used instead of the legacy one.

However, this change will:

Be implemented in a separate patch to allow the base case to land first.
Ensure that such NICs are properly live migrated using the new code path.

Alternatives

REST API impact

The schema definition for PCI aliases needs to be modified to allow the specification of live migratable devices.

However, this change should not require a microversion bump.

Data model impact

LiveMigrateDate object will be extended to supply the PCI devices info of the destination host introducing a new pci_devices field.

Security impact

Notifications impact

Other end user impact

Performance Impact

If PCI in placement is enabled, this bug should be taken into account as it may impact performance.

Mitigation measures are currently being developed to minimize this impact.

Other deployer impact

The user is fully responsible for configuring the following:

Device specifications and aliases.
Flavors: If users need to support multiple kinds of VFs, they must use different flavors for each VF type.

Developer impact

None

Upgrade impact

All VMs with devices that rely on the VFIO SR-IOV variant driver cannot be migrated until they use a new flavor that includes the correct updated aliases pointing to the revised PCI device specifications.

This can be achieved by resizing the VM and changing its flavor to the new one.

For NICs, an alternative approach could be to detach and reattach the device.

Implementation

Assignee(s)

Primary assignee:: Uggla (René Ribaud)
Main contributors:: Bauzas (Sylvain Bauza)

Feature Liaison

Feature liaison:: N/A

Work Items

Parse live_migratable from [pci]dev_spec config.
Add HW_PCI_LIVE_MIGRATABLE trait.
Check source instance for appropriate live migratable devices.
Add a new filter in filter_pools to manage live migratable devices.
Update LiveMigrateData to include PCI device information.
Update get_updated_guest_xml() function to include PCI device information.

Dependencies

Support for devices using the variant driver interface. specification.
Performance impact bug.

Testing

Unit tests and functional tests.
Tempest and/or whitebox tests cannot be executed in CI due to hardware limitations. They can, however, be developed in parallel with this implementation and deferred for later inclusion in CI.

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Epoxy	Introduced

OpenAPI Schemas

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/openapi-2

Note

This is a continuation of a spec that was previously approved in Dalmatian (2024.2). We merged all of the groundwork for this in Dalmatian but did not get the response bodies schemas merged.

Problem description

Use Cases

As an end user, I would like to have access to machine-readable, fully validated documentation for the APIs I will be interacting with.

As an end user, I want statically viewable documentation hosted as part of the existing docs site without requiring a running instance of Nova.

As an SDK/client developer, I would like to be able to auto-generate bindings and clients, promoting consistency and minimising the amount of manual work needed to develop and maintain these.

As a Nova developer, I would like to have a verified API specification that I can use should I need to replace the web framework/libraries we use in the event they are no longer maintained.

Proposed change

This effort can be broken into a number of distinct steps:

Add a new decorator for removed APIs and actions

We have a number of APIs and actions that no longer have backing code and return HTTP 410 (Gone) or HTTP 400 (Bad Request), respectively. We will not add schemas for these in the initial attempt at this so we need some mechanism to indicate this. We will add a new removed decorator that will highlight these removed APIs and indicate the version they were removed in and the reason for their removal. We can later use this as a heuristic in our tests to skip schema checks for these methods.

Note

This was completed in Dalmatian (2024.2)
Add missing request body and query string schemas

There is already good coverage of both request bodies and query string parameters but it is not complete. A list of incomplete schemas is given at the end of this section. The additional schemas will merely validate what is already allowed, which will mean extensive use of "additionalProperties": true or empty schemas. Put another way, an API that currently ignores unexpected request body fields or query string parameters will continue to ignore them. We may wish to make these stricter, as we did for most APIs in microversion 2.75, but that is a separate issue that should be addressed separately.

Once these specs are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate schemas.

Note

This was completed in Dalmatian (2024.2)
Add response body schemas

These will be sourced from existing OpenAPI schemas, currently published at github.com/gtema/openstack-openapi, from Tempest’s API schemas, and where necessary from new schemas auto-generated from JSON response bodies generated in tests and manually modified handle things like enum values.

Once these are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate response body schemas. In addition, we will add a new configuration option that will control how we do verification at the API layer, [api] response_validation. This will be an enum value with three options:

error
Raise a HTTP 500 (Server Error) in the event that an API returns an “invalid” response.

This will be the default in CI i.e. for our unit, functional and integration tests. This should not be used in production. The help text of the option will indicate this and we will set the advanced option.

warn
Log a warning about an “invalid” response, prompting operations to file a bug report against Nova.

This will be initial (and likely forever) default in production.

ignore
Disable API response body validation entirely. This is an escape hatch in case we mess up.

Note

Alternatives

Use a different tool

We could use a different tool than OpenAPI to publish our specs. In a manner of speaking we already do this - albeit not in a machine-readable manner - through our use of os-api-ref.

This idea has been rejected because OpenAPI is clearly the best tool for the It is the most widely used API description language available today and aligns well with our existing use of JSON Schema for API validation. While it does not support OpenStack’s microversion API design pattern out-of-the-box, previous experiments have demonstrated that it is extensible enough to add this.
Maintain these specs out-of-tree

We could use a separate repo to store and maintain specs for Nova and the other OpenStack services.

This idea has been rejected because it prevents us testing the specs on each commit to Nova and means work that could be spread across multiple teams is instead focused on one small team. It will result in more bugs and a lag between changes to the Nova API and changes to the out-of-tree specs. It will result in duplication of effort across Nova, Tempest, and the specs projects.
Publish the spec via an API resource rather than in our docs

We could publish the spec via a new, unversioned API endpoint such as /spec. A GET request to this would return the full spec, either statically generated at deployment time or dynamically generated (and then cached) at runtime.

This is rejected because it brings limited advantages and multiple disadvantages. Nova’s API is designed to be backwards-compatible and non-extensible. As such, a user with the latest version of the spec should be able to use it to communicate with any OpenStack deployment running a version of Nova that supports microversions. It is also expected that the “master” version of the spec will continuously improve as things are tightened up, documentation is improved, and bugs or mistakes are corrected. We want consumers of the spec to see these changes immediately rather than wait for their deployment to be updated. Finally, OpenStack’s previous forays into discoverable APIs, such as Keystone’s use of JSONHome or Glance’s attempts to publish resource schemas, have seen limited take-up outside of the projects themselves. Taken together, this all suggests there is no reason or advantage to publishing deployment-specific specs and users would be better served by fetching the latest version of the spec from the api-ref documentation published on docs.openstack.org (which, one should note, is itself intentionally unversioned).

Data model impact

None.

REST API impact

We may wish to address issues that are uncovered as we add schemas, but this work is considered secondary to this effort and can be tackled separately.

Security impact

None.

Notifications impact

None.

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Developers working on the API microversions will now be encouraged to provide JSON Schema schemas for both requests and responses.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephenfinucane
Other contributors:: gtema

Feature Liaison

None.

Work Items

Add missing request body schemas
Add tests to validate existence of request body schemas
Add missing query string schemas
Add tests to validate existence of query string schemas
Add response body schemas
Add decorator to validate response body schemas against response
Add tests to validate existence of response body schemas

Dependencies

Testing

Unit tests will ensure that schemas eventually exist for request bodies, query strings, and response bodies.

Unit, functional and integration tests will all work together to ensure that response body schemas match real responses by setting [api] response_validation to error.

Documentation Impact

References

APIs missing schemas

These are the APIs that are currently (as of 2024-04-11, commit 1bca24aeb) missing API request body schemas and query string schemas.

Missing request body schemas

AdminActionsController._inject_network_info
AdminActionsController._reset_network
AgentController.create
AgentController.update
BareMetalNodeController._add_interface
BareMetalNodeController._remove_interface
BareMetalNodeController.create
CellsController.create
CellsController.sync_instances
CellsController.update
CertificatesController.create
CloudpipeController.create
CloudpipeController.update
ConsolesController.create
DeferredDeleteController._force_delete
DeferredDeleteController._restore
FixedIPController.reserve
FixedIPController.unreserve
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPDNSDomainController.update
FloatingIPDNSEntryController.update
LockServerController._unlock
NetworkAssociateActionController._associate_host
NetworkAssociateActionController._disassociate_host_only
NetworkAssociateActionController._disassociate_project_only
NetworkController._disassociate_host_and_project
NetworkController.add
NetworkController.create
PauseServerController._pause
PauseServerController._unpause
RemoteConsolesController.get_rdp_console
RescueController._unrescue
SecurityGroupActionController._addSecurityGroup
SecurityGroupActionController._removeSecurityGroup
SecurityGroupController.create
SecurityGroupController.update
SecurityGroupDefaultRulesController.create
SecurityGroupRulesController.create
ServersController._action_confirm_resize
ServersController._action_revert_resize
ServersController._start_server
ServersController._stop_server
ShelveController._shelve
ShelveController._shelve_offload
SuspendServerController._resume
SuspendServerController._suspend
TenantNetworkController.create

Missing request query string schemas

AgentController.index
AggregateController.index
AggregateController.show
AvailabilityZoneController.detail
AvailabilityZoneController.index
BareMetalNodeController.index
BareMetalNodeController.show
CellsController.capacities
CellsController.detail
CellsController.index
CellsController.info
CellsController.show
CertificatesController.show
CloudpipeController.index
ConsoleAuthTokensController.show
ConsolesController.index
ConsolesController.show
ExtensionInfoController.index
ExtensionInfoController.show
FixedIPController.show
FlavorAccessController.index
FlavorExtraSpecsController.index
FlavorExtraSpecsController.show
FlavorsController.show
FloatingIPBulkController.index
FloatingIPBulkController.show
FloatingIPController.index
FloatingIPController.show
FloatingIPDNSDomainController.index
FloatingIPDNSEntryController.show
FloatingIPPoolsController.index
FpingController.index
FpingController.show
HostController.reboot
HostController.show
HostController.shutdown
HostController.startup
HypervisorsController.detail
HypervisorsController.index
HypervisorsController.search
HypervisorsController.servers
HypervisorsController.show
HypervisorsController.statistics
HypervisorsController.uptime
IPsController.index
IPsController.show
ImageMetadataController.index
ImageMetadataController.show
ImagesController.detail
ImagesController.index
ImagesController.show
InstanceActionsController.index
InstanceActionsController.show
InstanceUsageAuditLogController.index
InstanceUsageAuditLogController.show
InterfaceAttachmentController.index
InterfaceAttachmentController.show
NetworkController.index
NetworkController.show
QuotaClassSetsController.show
QuotaSetsController.defaults
QuotaSetsController.detail
QuotaSetsController.show
SecurityGroupController.show
SecurityGroupDefaultRulesController.index
SecurityGroupDefaultRulesController.show
ServerDiagnosticsController.index
ServerGroupController.show
ServerMetadataController.index
ServerMetadataController.show
ServerMigrationsController.index
ServerMigrationsController.show
ServerPasswordController.index
ServerSecurityGroupController.index
ServerTagsController.index
ServerTagsController.show
ServerTopologyController.index
ServerVirtualInterfaceController.index
ServersController.show
SnapshotController.show
TenantNetworkController.index
TenantNetworkController.show
VersionsController.show
VolumeAttachmentController.show
VolumeController.show

Note

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced. Missing query schema and request body schemas added.
2025.1 Epoxy	Re-proposed to finish response body schemas.

Show Scheduler Hints in Server Details

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/show-scheduler-hints-in-server-details

Nova currently lacks a straightforward way to expose scheduler hints associated with a server. This proposal suggests extending existing Nova’s API to allow users to retrieve this information when it is available.

Problem description

Scheduler hints can be specified at server creation time and can influence placement decisions based on the user-provided configuration. These hints are stored in the Nova’s database and can be later considered by the scheduler during a server migration. Without this information beforehand, an API user can choose an invalid destination host for a migration request, and face difficulties to understand the real cause of the failure.

Use Cases

As an operator, I want to retrieve more details about a server creation request, which includes the associated scheduler_hints.
As a cloud admin, I want to check more informations associated to all running servers, including their scheduler hints, in order to build an migration plan from a host.
An optimization service like Watcher [1] would benefit from additional placement constraints, like scheduler hints, from all instances of a host in order to build a more concrete action plan to optimize the workload balance across the cluster. Without this information, Watcher could propose a solution that contains lots of server migration actions that violate some constraints. E.g.: Watcher would not account that a host is an invalid destination for a server that was created with a different_host scheduler hint.

Proposed change

Code changes

Extend the API response for GET /servers/{server_id} and the GET /servers/detail to include information about the scheduler hints.

Add a new entry in the API response with the key scheduler_hints, containing all persisted scheduler hints associated with the corresponding server. The value format will follow the same json schema defined in server creation request [2]. If a server has no information about scheduler hints, the value will be set to {}.

Both openstack client and openstack sdk will be updated to support the new API and display the new field added.

Alternatives

User driven instance metadata

Users could aditionally store scheduler hints information in instance metadata. This would allow them to query this inforamation later when needed. The drawbacks are that it duplicates this information in nova database and also requires an additional manual step from user’s side.

Data model impact

None.

REST API impact

The following change will be introduced in a new API microversion:

GET /servers/{server_id}

Show Server Details

Return Code(s): 400, 401, 403 (no changes)

Proposed JSON response addition:

{
    "server": {
        ...
        "scheduler_hints": {
            "group": "af16eb84-88fe-4cc4-b558-1752cbe8cb15",
            "same_host": "6605bff6-86b9-4824-b35b-a6b3c4c0e717"
        },
        ...
    }
}

GET /servers/detail

List Servers Detailed

Return Code(s): 400, 401, 403 (no changes)

Proposed JSON response addition:

{
    "servers": [
        {
            ...
            "scheduler_hints":{
              "group":"dc0ca1ef-7e0b-4cb5-89aa-b2069f8b8a8a",
              "different_host":"6dffb036-d020-4630-b467-334400a050ca"
            },
            ...
        }
    ]
}

The default policy of the new field will be project_reader_or_admin to match with the existing /servers/detail policy.

Security impact

None.

Notifications impact

None.

Other end user impact

A new field with scheduler hints information will be added in the output of the commands Show Server Details and List Servers Detailed, in both openstack client and openstack sdk.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: dviroel

Feature Liaison

None.

Work Items

Add a new field to the server details response in a new microversion, and populated it with the persisted scheduler hints.
Extend existing unit and functional tests, including API sample tests.
Extend existing scheduler_hints and show server details tempest tests to validate that the new microversion contains scheduler_hints information.
Update API documentation, including API samples in API Reference.
Update openstack client and openstack sdk to support the new microversion and to show the new field.

Dependencies

None.

Testing

Existing unit, funcional, API sample and tempest tests can be extended to validate that the new microversion contains scheduler_hints information. If needed, new tests can be added to properly cover other scenarios.

Documentation Impact

API Reference
REST API Version History
openstack client and openstack sdk documentation

References

Previous spec proposal for this blueprint:
https://review.opendev.org/c/openstack/nova-specs/+/440580

History

Revisions
Release Name	Description
2025.1 Epoxy	Introduced

Config option to control behavior of unset unified limits

Tue, 11 Mar 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/unified-limits-nova-unset-limits

The default behavior in the oslo.limit quota enforcement library used by Nova when [quota]driver is set to nova.quota.UnifiedLimitsDriver is to consider resources that do not have registered limits set as having a limit of zero. This behavior can be unforgiving especially in the scenario of an upgrade that enables unified limits quota (i.e. if we ever want to make unified limits the default). If we make the behavior configurable within Nova, we can help prevent situations where an admin/operator upgrades or installs Nova and suddenly all API requests begin to be rejected for being over quota.

Problem description

The problem is centered around the behavior of the oslo.limit quota enforcement library when a given resource does not have a registered limit set for it. If no registered limit is found for a resource, the enforce function will consider that resource to have a limit of 0 and all requests for the resource will fail for being over quota.

We want to be able to change the default quota driver to the UnifiedLimitsDriver, but the aforementioned behavior raises concerns about changing the default.

If we were to make unified limits quotas the default in Nova, any admin/operator who has missed auditing all of their resources and limits in Keystone before upgrading could experience complete denial of service by the Nova API immediately after the upgrade. This could happen if even one resource is missing a registered limit set in Keystone.

While ideally an admin/operator will not miss setting any registered limits in an upgrade scenario like this, the penalty for missing even one resource limit is quite harsh as the API rejects all requests for that resource leading to an immediate emergency situation.

Use Cases

As an admin/operator, I would like to be able to control which resources I will require to have a limit set. And I would also like to be able to control which resources I do not need any limit set, by not including them in the required resources list.
As an admin/operator, I would like to be able to see a DEBUG log message if I have missed setting a registered limit for a resource in Keystone, rather than to have all API requests involving that resource be rejected for being over quota.

Proposed change

The proposal in this spec is to add new configuration option(s) to the [quota] group which would enable operators to:

Require limit enforcement for only specific resources, or
Require limit enforcement for all resources except specific resources

The goal with the options is to make management of unset unified limits easy and maintainable over time.

Alternatives

Alternatively we could make a change to the oslo.limit library to handle missing registered limits differently than it does today [1]. This would be more difficult because oslo.limit 1) has established and thus expected default behavior and 2) providing new behavior that fits all OpenStack projects may not be realistic.

A previously proposed alternative would be a boolean config option [quota]strict_unified_limits which has only two modes: consider unset limits as zero or consider unset limits as unlimited [2]. Discussion at the last PTG raised concerns that a boolean option is likely too generic and wouldn’t provide the level of control most operators would need.

Data model impact

None

REST API impact

None

Security impact

If a resource is not configured to require limit enforcement, that resource would be considered to have unlimited quota and malicious callers could attempt to exhaust that resource intentionally.

Notifications impact

None

Other end user impact

None

Performance Impact

The performance impact of using new config options to handle unset limits should be relatively small as it will add one extra Keystone API call each time 1) a quota check fails and 2) the limit for the associated resource is returned as 0 by oslo.limit.

Other deployer impact

Admin/operators will need to consider if and when they will need to adjust configuration values if new Placement resource classes are added to their deployment in the future.

Also as part of this work, the nova-manage limits migrate_to_unified_limits CLI command will be enhanced to scan the database for resources in flavors that do not have registered limits set and show them in the output. The intent is to help admins/operators catch all resources and set limits for them before unified limits quotas are enabled.

Developer impact

None

Upgrade impact

There should not be upgrade impact with the new configuration options.

For a deployer not running with [quota]driver = nova.quota.UnifiedLimitsDriver, the config options have no effect.

For a deployer already running with [quota]driver = nova.quota.UnifiedLimitsDriver, they will have had to set registered limits for all resources allocated by their cloud (because the current behavior is to default all limits to zero) and should not experience any change in quota enforcement for those resources.

After upgrading however, any _new_ resource the deployer adds to the cloud will either default to unlimited quota or default to zero quota until the deployer sets a registered limit for it in Keystone, depending on how the deployer has configured the new options. If the deployer needs to update config option values, they need to update them for the nova-api and nova-conductor services. Quota “rechecks” are performed by the nova-conductor service if [quota]recheck_quota = True (the default).

For a deployer switching to the [quota]driver = nova.quota.UnifiedLimitsDriver during the upgrade, the default behavior will only require limits for the default resources in the config options (currently proposed as servers).

It is recommended for these deployers to first use the nova-manage limits migrate_to_unified_limits tool to have it read their legacy quota limits from the Nova database and [quota] config options and set them in Keystone automatically. The output of the command will also show what resources, if any, are found to be used in the deployment but do not have registered limits set in Keystone. Deployers can use this information to know what resources they need to set limits for in Keystone.

Then, deployers should add or remove resources from the list based on the resources they want to require to enforce quota. All other resources will be considered to have unlimited quota until the deployer sets registered limits for them in Keystone.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: None

Feature Liaison

Feature liaison:: melwitt

Work Items

Add configuration options to control which resources to require a registered limit set in Keystone
Augment the nova-manage limits migrate_to_unified_limits command to scan database flavors to detect resources that do not have registered limits set and show them in the output to the user to let them know which limits they need to set

Dependencies

Related to https://specs.openstack.org/openstack/nova-specs/specs/yoga/implemented/unified-limits-nova.html

Testing

The functionality of the new config options will be tested by writing new functional tests. Adding testing to the post test hook for the nova-next CI job is also a possibility.

Documentation Impact

The unified limits documentation will be updated to include information about the new config options.

References

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced
2025.1 Epoxy	Re-proposed with changes

Search flavors by name

Thu, 30 Jan 2025 00:00:00

https://blueprints.launchpad.net/nova/+spec/flavor-search-by-name

Allow users to search for flavor by name server-side.

Problem description

Use Cases

As a developer of client tooling, I would like to do as much filtering server-side as possible, in order to improve performance and reduce unnecessary network traffic.

Proposed change

>>> import openstack
>>> conn = openstack.connect('devstack')
>>> conn.compute.get('/flavors')
>>>
>>> [f['name'] for f in conn.compute.get(r'/flavors').json()['flavors']]
['m1.small', 'ci.m1.small', 'm1.medium', 'ci.m1.medium', 'm2.small', 'ds512M', 'ds1G']
>>>
>>> [f['name'] for f in conn.compute.get(r'/flavors?name=m1').json()['flavors']]
['m1.small', 'ci.m1.small', 'm1.medium', 'ci.m1.medium']
>>>
>>> [f['name'] for f in conn.compute.get(r'/flavors?name=^m1').json()['flavors']]
['m1.small', 'm1.medium']

This will be implemented by reusing the logic currently used for instances in the _regex_instance_filter, seen here.

While we are introducing a new microversion, we will also take the opportunity to address some other tech debt with the schema:

We will set additionalProperties to False for the flavor show (GET /flavors/{flavor_id}) API
We will remove the rxtx_factor field from the flavor create (POST /flavors), flavor list with details (GET /flavors/detail) and flavor show (GET /flavors/{flavor_id}) APIs. We will also remove rxtx_factor from the list of valid sort keys for the flavor list (GET /flavors) and flavor list with details (GET /flavors/detail) APIs. This field was only supported by the long since removed XenAPI driver and is a no-op in modern Nova.
We will remove the OS-FLV-DISABLED:disabled field from the flavor list with details (GET /flavors/detail) and flavor show (GET /flavors/{flavor_id}) APIs. There has never been a way to set this field, making it a no-op.

Finally, we will build on one of the above items and address some tech debt with other schemas:

We will set additionalProperties to False for all query string schemas.
We will restrict all action bodies to null values except those where a value is actually expected.

Alternatives

We currently have to do this stuff client-side, which is less performant. We could continue to do so.

Data model impact

REST API impact

The GET /flavors API will be modified to add support for a new name query string filter parameter in requests
The POST /flavors API will be modified to remove support for the rxtx_factor parameter in requests.
All flavors API will be modified to remove the rxtx_factor and OS-FLV-DISABLED:disabled fields from responses.
All API that currently accept an unrestricted set of query string parameters will be modified to restrict these.
All action APIs that currently restrict an unrestricted value in request bodies will be modified to only accept null.

Security impact

None.

Notifications impact

None.

Other end user impact

openstackclient and third-party clients can take advantage of this when filtering flavors.

Performance Impact

None. Clients will be faster since they can take advantage of server-side filtering, but there should be no impact on the server itself since the field is indexed.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephen.finucane
Other contributors:: None

Feature Liaison

Feature liaison:: stephen.finucane

Work Items

Extend API and rework schemas as described above

Dependencies

None.

Testing

We will provide new unit and functional tests, including API sample tests.

We will extend the Compute API schemas used in Tempest to reflect these changes.

Documentation Impact

Update API ref.

References

None.

Support for tracking traits removed from provider.yaml

Thu, 12 Dec 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/copy-applied-provider-yaml

This specification proposes a feature to ensure that traits removed from the provider.yaml are also properly deleted from the resource provider.

Problem description

Nova-compute has a feature to register custom traits with the resource provider using config files (provider.yaml). https://docs.openstack.org/nova/latest/admin/managing-resource-providers.html

Use Cases

As a cloud operator, I would like to ensure that only one trait is registered with the resource provider for custom traits of the same type.
As a cloud operator, I would like to complete the registration of custom traits in the config file of nova-compute without additional implementation (calling the Placement API using API/CLI in another system).

Proposed change

We propose adding a process for nova-compute to copy the contents of the provider.yaml file to /var/lib/nova/applied_provider.yaml after they have been applied to the placement.

For now, the diff is limited to traits, but later this logic can be extended to allow the use of the diff for any part of the provider.yaml.

Alternatives

Register only the custom traits defined in the file with the resource provider, treating provider.yaml as declarative data. However, this is a destructive change and there are concerns about the impact on the existing environment.
Add a definition like declarative_prefix to provider.yaml to handle only traits with a declarative_prefix declaratively. In this case, the extensibility to non-trait elements in provider.yaml is limited, and both the definition in provider.yaml and the code of the resource tracker become complex.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

No performance impact on nova is anticipated. If there are frequent updates to custom traits, requests for deleting and creating traits will be frequently sent to the Placement API.

Other deployer impact

None

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: mkuroha
Other contributors:: None

Feature Liaison

Feature liaison:: Liaison Needed

Work Items

Implement the copying of provider.yaml and extraction of trait diffs with applied_provider.yaml in the _merge_provider_configs method.

Dependencies

None

Testing

Add unit/functional tests

Documentation Impact

Update the existing Managing Resource Providers Using Config Files guide to explation the behavior with applied_provider.yaml.

References

None

History

Revisions
Release Name	Description
2025.2 Flamingo	Introduced

vTPM live migration

Thu, 21 Nov 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/vtpm-live-migration

Libvirt support for vTPM live migration now exists (more details in Problem description), but Nova changes are necessary before being able to remove the API block. This spec describes those changes.

Problem description

vTPM state storage

vTPM state storage is not the same as instance state storage. The latter can be configued to be shared, for example on NFS. The former is always non-shared. Libvirt can be told where to store the vTPM state via the source XML element, which Nova does not support. Nova deployments use the Libvirt default vTPM state path. On both Ubuntu and Red Hat operating systems, this path is /var/lib/libvirt/swtpm/<instance UUID>. This path is distinct from the instance state path and can be expected to never be on shared storage.

Thus, this spec requires vTPM state storage to be not shared, and declares live migration with shared vTPM state storage to be untested. This will be documented.

Libvirt support

Therefore, this spec requires Libvirt 7.1.0.

Secret management

Compute host reboot

For the exact same reasons (lack of Barbican secret access and inability to read the Libvirt secret back from Libvirt), Nova cannot start back up vTPM instances after a compute host reboot.

Use Cases

As a cloud operator, I want to be able to live migrate instances with vTPM devices, in particular Windows instances.

As a cloud operator, I want vTPM instances on a compute host to start back up again after a host reboot.

Proposed change

Three possible security levels are proposed. They are presented in the table below.

`vtpm_secret_security` values
Value	Mechanism	Security implications	Instance mobility
`user`	Only the instance owner has access to the Barbican secret. This is existing behavior.	This is the most secure option, as even the Nova service user and root on the compute host cannot read the secret.	The instance is immovable and cannot be restarted by Nova in the event of a compute host crash or reboot.
`host`	The Libvirt secret is persistent and retrievable.	This is “medium” security. API-level admins and the Nova service user do not have access to the secret, but it can be accessed by users with sufficient privileges on the compute host.	The instance can be live migrated because Nova can read the secret back from Libvirt on the source host and send it to the destination over RPC. Security over the wire is left as the operator’s responsibility, but TLS or similar is assumed. The instance can also be restarted by Nova in the event of a compute host crash or reboot for the exact same reason.
`deployment`	The Nova service user owns the Barbican secret.	This is the least secure but most flexible option.	The instance can be live migrated because Nova can download the secret from Barbican and define it in Libvirt on the destination host. The instance can also be restarted by Nova in the event of a compute host crash or reboot for the exact same reason.

Users are able to chose what level they require on their instance by setting the new hw_vtpm_secret_security image property. If this property is not set, a default can be obtained from the new hw:vtpm_secret_security flavor extra spec. For operators that do not want to deal with flavor explosion as a consequence of this new extra spec, a new host configuration option is added as a fallback. Called [compute]vtpm_secret_security with a default value of host, an instance with no image property or flavor extra spec will have its host’s vtpm_secret_security policy persisted in its system_metadata upon booting on that host.

Operators ae able to specify what level they support by using the new [compute]supported_vtpm_secret_security config option. This is a per compute host list option that can take the value of one or more of the security levels from the previous table. Its default value is all three levels. These values are exposed as driver capability traits. The hw_vtpm_secret_Security image property and flavor extra spec are translated to required traits to match the driver capabilities.

The behavior of an instance during live migratioon is defined by its persisted hw_vtpm_secret_security (either explicitly set by the user, or added by default by Nova from the host’s config option). Instances with user cannot be live migrated. For instances with host, the source compute host reads the secret from Libvirt and sends it over RPC to the destination. For instances with deployment, the destination host downloads the secret from Barbican and defines it in Libvirt. Because the instance’s hw_vtpm_secret_security value translates to a required trait, it’s guaranteed that the destination host chosen for live migration supports whatever behavior the instance requires.

Alternatives

Data model impact

The ImageMetaProps Nova object is updated to support the new hw_vtpm_secret_security image property. The database schema is unaffected.

REST API impact

No new microversion. The flavor extra spec validation code is updated to allow hw:vtpm_secret_security.

Security impact

The main security consequences of this spec are the implications of the host and deployment values of vtpm_secret_security.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

A compute service version bump is necessary. When nova-compute starts up with the new service version, it checks all instances currently on the host. Any instances created after the service version bump have a value for hw_vtpm_secret_security set in their system_metadata, either explicitly by the user or implicitly by Nova as a fallback default, as described in the <Proposed change_>_ section. Any instances without this set are pre-existing instances, and need to be upgraded. They are upgraded to the value of the [compute]default_vtpm_secret_security value. Just persisting this in their system_metadata is not enough - their owner also needs to performa an operation with their token on the instance so that Nova can either convert the Libvirt secret to non-private and persistent in the case of host, or create a new Barbican secret with the same contents, but owned by the Nova service user, in the case of deployment. Operators have no choice but to communicate this to their users, at which point users have a choice to either opt in to the new security level, or refuse by not touching their instances or deleting them outright. In order to see what secret security level has been set on their instances by the operators, this spec depends on the Image props in server show spec, which will allow users to see the embedded image properties set on their instance, and determine the vTPM secret security level that way.

Implementation

Assignee(s)

Primary assignee:: notartom

Feature Liaison

Feature liaison:: melwitt, dansmith

Work Items

Introduce the hw_vtpm_secret_security, hw:vtpm_secret_security, [compute]vtpm_secret_security, and [compute]default_vtpm_secret_security image properties, flavor extra specs, and config options.
Modify the pre live migration and rollback code to handle secret definition and cleanup.
Bump the service version.
Modify the existing API block to only allow live migration of host or deployment instances once the minimum service version has reached the bumped version.
Add a whitebox/integration test.
Update the documentation.

Dependencies

Libivrt version 7.1.0. This can be enforced dynamically in code.

Testing

Nova’s functional tests are extended to test the Nova logic using the Libvirt fixture. This is particularly useful for cases that cannot be easily tested in a real environment, like rollback.

The existing whitebox-tempest-plugin vTPM tests are extended to test live migration in a real environment with an actual Libvirt.

Documentation Impact

Nova’s vTPM documentation is updated to remove the live migration limitation and explain the usage of the vtpm_secret_security configuration option, as well as the implications of all possible values. The expectation that vTPM state storage is not shared and that shared vTPM state storage live migration is untested is made explicit.

References

Empty.

History

Revisions
Release Name	Description
2025.1 Epoxy	Introduced

Support Cinder Volume Multi-attach

Fri, 08 Nov 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/multi-attach-volume

Currently, Nova only allows a volume to be attached to a single instance. There are times when a user may want to be able to attach the same volume to multiple instances.

Problem description

Currently Nova is not prepared to attach a single Cinder volume to multiple VM instances even if the volume itself allows that operation. This document describes the required changes in Nova to introduce this new functionality and also lists the limitations it has.

Use Cases

Allow users to share volumes between multiple guests using read-write attachments like clustered applications with two nodes where one is active and one is passive. Both require access to the same volume although only one accesses actively. When the active one goes down, the passive one can take over quickly and has access to the data.

The above example works with active/active scenario as well, it’s the user’s responsibility to choose the right filesystem.

Proposed change

The new ‘multi-attach’ functionality will be enabled by using the new Cinder attach/detach API which is available from the API microversion 3.44 [1].

Cinder will only allow a volume to be attached more than once if its ‘multiattach’ flag is set on the volume at create time. Nova is expected to rely on Cinder to do the check on the volume state when it’s reserving the volume on the API level by calling attachment_create.

There are problems today when multiple volume attachments share a single target to the volume backend [2]. If we do not take care, multi-attach would make these problems much worse. The simplest fix is to serialize all attach and detach operations involving a shared target. To do this Cinder will expose a volume info property of ‘shared_targets’, when True a lock will be placed around all attachment_update and attachment_delete calls, and the associated calls to os-brick.:

# The lock uses the volume.backend_uuid value.
with optional_host_local_lock(acquire=volume.shared_target):
  connector = os_brick.get_connector()
  conn_info = attachment.update(connector).conn_info
  os_brick.connect_volume(conn_info)
  attachment.attach_complete()

with optional_host_local_lock(acquire=volume.shared_target):
  os_brick.disconnect_volume(conn_info)
  attachment.delete()

Note

We assume the detach and attach related calls to Cinder are synchronous so there will be no races between os-brick operations on the host and cinder operations on the backend. Any driver deviation from this pattern will be considered a bug.

By default libvirt assumes all disks are exclusively used by a single guest. If you want to share disks between instances, you need to tell libvirt when configuring the guest XML for that disk via setting the ‘shareable’ flag for the disk. This means that the hypervisor will not try to take an exclusive lock on the disk, that all I/O caching is disabled, and any SELinux labeling allows use by all domains.

Nova needs to set this ‘shareable’ flag for the multi-attach volumes (where the ‘multattach’ flag is set to True) for every single attachment. This spec will only enable this feature for libvirt, all other drivers should reject attach calls to multi-attach volumes, until that driver adds support to this functionality. The information is stored among the virt driver capabilities dict in the base ComputeDriver where support multi-attach will be True for Libvirt and for all other virt drivers this capability is disabled. To introduce the usage of the flag we will also need to bump the minimum compute version.

The following policy rules will be added to Cinder:

Enable/Disable multiattach=True
Enable/Disable multiattach=True + bootable=True

Nova should reject the attach request in case the hypervisor does not support it, but with the current API it is not possible. This can be solved in part with the policy rules above. For example, if you’re running a cloud with computes that don’t support multiattach, let’s say it’s all vmware, then the operator can configure policy to disable multiattach volumes on the cinder side. If you’ve got a mixed hypervisor cloud and the user tries to attach a multiattach volume to an instance on a compute where the virt driver doesn’t support multiattach, then the attach request fails on the compute and nova-compute calls attachment_delete to delete the attachment created in nova-api’s attach_volume code. If nova-api exposed backend compute driver capabilities then we could check and fail fast in the API, but nova doesn’t have that yet so we’re just left with policy rules and checks on the backend.

Alternatives

For the use case described above the failover scenario can be handled by attaching the volume to the passive/standby instance. This means that the standby instance is not a hot standby anymore as the volume attachment requires time, which means that the new primary instance is without volume for the time of re-attaching, which can vary in the sense of marking the volume free after the failure of the primary instance.

Another alternative is to clone a volume and attach the clone to the second instance. The downside to this is any changes to the original volume don’t show up in the mounted clone so this is only a viable alternative if the volume is read-only.

Data model impact

None

REST API impact

There are features of the Nova API that has to be handled by care or disabled completely for now for volumes that support multi-attach.

The create call in the ‘os-assisted-volume-snapshot’ API calls the ‘volume_snapshot_create’ where we don’t have the instance_uuid to retrieve the right BDM, therefore we need to disable this call for multi-attach. The API format for this request is not changed, it is only a protection until the required API changes to support this request with multi-attach.

Another feature that needs further investigation is ‘boot from volume’ (BFV). The first aspect of the feature is the ‘delete_on_termination’ flag, which will be allowed to use along with multi-attach, no changes are necessary when the volume provided has multiattach=True and the delete_on_termination=True flag is passed in for BFV. When this flag is set to True it is intended to remove the volume that is attached to the instance when it is deleted. This option does not cause problem as Cinder takes care of not deleting a volume if it still has active attachments. Nova will receive an error from Cinder that the volume deletion failed, which will then be logged [3] and also in the API on ‘_local_delete’ [4], but will not affect the instance termination process.

The second aspect of BFV is the boot process. In this case Nova only checks the ‘bootable’ flag. The policy check happens on the Cinder side on allowing it together with multiattach or not.

For cases, where Nova creates the volume itself, i.e. source_type is blank/image/snapshot, it should not enable multi-attach for the volume, i.e. no change to the existing code for now.

When we attach a volume at boot time (BFV with source=volume,dest=volume) scheduling will fail in case of selecting computes that do not support multi-attach. Later on we can add a new scheduler filter to avoid the failure. The filter would check the compute capabilities. This step is considered to be a future improvement.

When we enable the feature we will have a ‘multiattach’ policy to enable or disable the operation entirely on the Cinder side as noted above. Read/Only policy is a future work item and out of the scope of this spec.

A new compute API microversion will be added since users will need some way to discover if they can perform volume multiattach. The semantics of the microversion will be similar to the 2.49 microversion for tagged attach.

Security impact

In the libvirt driver, the disk is given a shared SELinux label, and so that disk has no longer strong sVirt SELinux isolation.

The OpenStack volume encryption capability is supposed to work out of the box with this use case also, it should not break how the encryptor works below the clustered file system, by using the same key for all connections. The attachment of an encrypted volume to multiple instances should be tested in Tempest to see if there is any unexpected issue with it.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Implementation

Based on the work from Walter Boring and Charlie Zhou. Agreed with Walter to start the work again.

Assignee(s)

Primary assignee:: ildiko-vancsa

Work Items

Update libvirt driver to generate proper domain XML for instances with multi-attach volumes
Provide the necessary checks in the Nova API to block the operation in the above listed cases
Add Tempest test cases and documentation

Dependencies

This requires the version 3.2.0 or above of the python-cinderclient. Corresponding blueprint: https://blueprints.launchpad.net/python-cinderclient/+spec/multi-attach-volume
Corresponding, implemented spec in Cinder: https://blueprints.launchpad.net/cinder/+spec/multi-attach-volume
Link needed to Cinder spec to address detach issues currently captured here: https://etherpad.openstack.org/p/cinder-nova-api-changes

Testing

We’ll have to add new Tempest tests to support the new Cinder volume multiattach flag. The new cinder multiattach flag is what allows a volume to be attached more than once. For instance the following scenarios will need to be tested:

Attach the same volume to two instances.
Boot from volume with multiattach
Encrypted volume with multiattach
Boot from multi-attachable volume with boot_index=0
Negative testing:

Tying to attach a non-multiattach volume to multiple instances

Additionally to the above, Cinder migrate needs to be tested on the gate, as it triggres swap_volume in Nova.

Documentation Impact

We will have to update the documentations to discuss the new ability to attach a volume to multiple instances if the cinder multiattach flag is set on a volume. It is also need to be added to the documentation that the volume creation for these types of volumes will not be supported by the API due to the deprecation of the volume creation Nova API. If a volume needs to allow multiple volume attachments it has to be created on the Cinder side with the needed properties specified.

It also needs to be outlined in the documentation that attaching a volume multiple times in read-write mode can cause data corruption, if not handled correctly. It is the users’ responsibility to add some type of exclusion (at the file system or network file system layer) to prevent multiple writers from corrupting the data. Examples should be provided if available to guide users on how to do this.

References

This is the cinder wiki page that discusses the approach to multi-attach https://wiki.openstack.org/wiki/Cinder/blueprints/multi-attach-volume
Queens PTG etherpad: https://etherpad.openstack.org/p/cinder-ptg-queens-thursday-notes

History

Revisions
Release Name	Description
Kilo	Introduced
Liberty	Re-approved
Mitaka-1	Re-approved
Mitaka-2	Updated with API limitations and testing scenarios
Newton	Re-approved
Queens	Re-proposed

Use OpenStack SDK in Nova

Fri, 08 Nov 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/openstacksdk-in-nova

We would like to use the OpenStack SDK to interact with other core OpenStack services.

Problem description

Nova is using both python-${service}client and keystoneauth1 adapters to interact with other core services. Currently changes or fixes to a $service that Nova depends on may require changes to python-${service}client before Nova can be brought into parity. This also requires the OpenStack SDK to be brought into parity as it is used for the CLI.

Maintenance of python-${service}client can be burdensome due to high technical debt in the clients. By consuming the OpenStack SDK directly, we can eliminate the additional dependency on the python-${service}client and streamline the process.

Use Cases

As a developer on OpenStack, I would like to reduce the number of areas where maintenance must be performed when making changes related to the use of core OpenStack services.

As a core OpenStack project, Nova should make use of other projects in reliable and maintainable ways. To this end, we should use the OpenStack SDK for service to service interaction in place of direct API or python-${service}client implementations.

Proposed change

This spec proposes to use the OpenStack SDK in place of python-${service}client and other methods across Nova in three phases for each of the target services.

Target Services:

Ironic (python-ironicclient -> Baremetal SDK)
Cinder (python-cinderclient -> Block Storage SDK)
Glance (python-glanceclient -> Image v2 SDK)
Neutron (python-neutronclient -> Network SDK)
Placement (keystoneauth1 adapter -> OpenStack SDK Proxy)

The initial phase will consist of adding plumbing to construct an openstack.connection.Connection object to Nova components that interact with other services using a python-${service}client. The service proxy from this connection can then be used in place of existing keystoneauth1 adapter to retrieve the endpoint to configure the client. This is in progress at [[1]].

The main phase will be to iterate through calls to the python-${service}client and replace them with calls into the OpenStack SDK until the client is no longer needed. During this phase, we will close feature deficiencies identified in the OpenStack SDK as necessary. See OpenStack SDK Changes for a list of identified deficiencies. This process is in progress for python-ironicclient at [[2]].

For Placement, replace the keystoneauth1 adapter returned by nova.utils.get_ksa_adapter('placement') with the SDK’s placement proxy. This is transparent other than a small number of changes to mocks in tests. This is in progress at [[3]]. Eventually, the SDK may implement more support for placement. With the framework being in place, we can consider integrating such changes as they become available.

The final phase will simply be to remove the now-unused clients and clean up any remaining helpers and fixtures.

OpenStack SDK Changes

The OpenStack SDK includes a more complete selection of helpers for some services than others, but at worst provides the same primitives as a keystoneauth1 adapter. Development has started with Ironic, which has robust support within the OpenStack SDK. Other services will require additional work in the OpenStack SDK, or may need to be implemented using the API primitives provided by openstack.Proxy. It is more desirable to focus on expanding OpenStack SDK support for these projects rather than implementing them in Nova. Since there is not a spec repo for OpenStack SDK, we will try to outline the missing helpers by service here.

Ironic
node.get_console
node.inject_nmi
node.list_volume_connectors
node.list_volume_targets
node.set_console_mode (available via patch_node)
volume_target.create
volume_target.delete

Cinder
attachments.complete
attachments.delete
attachments.update
volumes.attach
volumes.begin_detaching
volumes.detach
volumes.initialize_connection
volumes.migrate_volume_completion
volumes.reserve
volumes.roll_detaching
volumes.terminate_connection
volumes.unreserve

Glance
image.add_location

Neutron
None

Alternatives

One possibility that was considered was to replace calls into python-${service}client with methods that invoke the $service APIs directly through the keystoneauth1 adapter’s get/put/etc primitives. This would entail effectively porting the python-${service}client code into nova. While this would give us the opportunity to clean things up, it would involve a lot of low-level work like version discovery/negotiation, input payload construction and validation, and output processing.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

The initial phase will have minimal impact as the only change is the construction of the keystoneauth1 adapter by the OpenStack SDK rather than directly. The main phase will not likely have any difference in performance and the final phase should approximately offset any impact from the initial phase.

Other deployer impact

None

Developer impact

By using the OpenStack SDK as the single method of contact with other services, the maintenance footprint can be reduced. This also moves us towards a more stable OpenStack SDK as more consumers generally mean more chances to find and resolve bugs.

In addition, as new methods and services are supported by the OpenStack SDK, introducing them to Nova should be simpler and more reliable than the current methods.

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: dustinc <dustin.cowles@intel.com>, efried <openstack@fried.cc>
Other contributors:: mordred <mordred@inaugust.com>, dtantsur <dtantsur@protonmail.com>

Work Items

Introduce package requirements to Nova.
Introduce plumbing for the construction of an openstack.connection.Connection object for each $service.
For each target $service (excluding Placement), close deficiencies in OpenStack SDK while replace invocations into python-${service}client one at a time, with calls into the SDK’s $service proxy.
- For Placement, replace the keystoneauth1 adapter with the SDK’s placement proxy.
Remove the now-unused python-${service}client, test fixtures, and other helpers and utils.

Dependencies

Nova support for using keystoneauth1 config options for Cinder.
- https://review.opendev.org/#/c/655985/

Testing

Existing unit tests will need to be updated to assert calls to the SDK instead of the client. In cases where the client call was mocked, this should be a matter of swapping out that mock and its assertions. No significant additional unit testing should be required.

Existing functional test cases should be adequate. Changes may be required in fixtures and other framework.

Existing integration tests should continue to function seamlessly. This will be the litmus test of success.

Documentation Impact

None

References

http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005810.html

https://docs.openstack.org/openstacksdk/latest/user/config/configuration.html

http://eavesdrop.openstack.org/irclogs/%23openstack-sdks/%23openstack-sdks.2019-05-20.log.html#t2019-05-20T13:48:07

History

Revisions
Release Name	Description
Train	Introduced

Use OpenStack SDK in Nova

Fri, 08 Nov 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/openstacksdk-in-nova

We would like to use the OpenStack SDK to interact with other core OpenStack services. Implementation began in Train and continues in Ussuri.

Problem description

Use Cases

As a developer on OpenStack, I would like to reduce the number of areas where maintenance must be performed when making changes related to the use of core OpenStack services.

Proposed change

This spec proposes to use the OpenStack SDK in place of python-${service}client and other methods across Nova in three phases for each of the target services.

Target Services:

Ironic (python-ironicclient -> Baremetal SDK)
Cinder (python-cinderclient -> Block Storage SDK)
Glance (python-glanceclient -> Image v2 SDK)
Neutron (python-neutronclient -> Network SDK)
Placement (keystoneauth1 adapter -> OpenStack SDK Proxy)

The final phase will simply be to remove the now-unused clients and clean up any remaining helpers and fixtures.

OpenStack SDK Changes

Ironic
node.get_console
node.inject_nmi
node.list_volume_connectors
node.list_volume_targets
node.set_console_mode (available via patch_node)
volume_target.create
volume_target.delete

Cinder
attachments.complete
attachments.delete
attachments.update
volumes.attach
volumes.begin_detaching
volumes.detach
volumes.initialize_connection
volumes.migrate_volume_completion
volumes.reserve
volumes.roll_detaching
volumes.terminate_connection
volumes.unreserve

Glance
image.add_location

Neutron
None

Alternatives

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

Other deployer impact

None

Developer impact

In addition, as new methods and services are supported by the OpenStack SDK, introducing them to Nova should be simpler and more reliable than the current methods.

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: dustinc <dustin.cowles@intel.com>
Other contributors:: mordred <mordred@inaugust.com>, dtantsur <dtantsur@protonmail.com>

Feature Liaison

Feature liaison:: efried

Work Items

(Implemented in Train) Introduce package requirements to Nova.
(Partially implemented in Train) Introduce plumbing for the construction of an openstack.connection.Connection object for each $service.
(Partially Implemented in Train) For each target $service (excluding Placement), close deficiencies in OpenStack SDK while replace invocations into python-${service}client one at a time, with calls into the SDK’s $service proxy.
- For Placement, replace the keystoneauth1 adapter with the SDK’s placement proxy.
Remove the now-unused python-${service}client, test fixtures, and other helpers and utils.

Dependencies

Nova support for using keystoneauth1 config options for Cinder.
- https://review.opendev.org/#/c/655985/

Testing

Existing functional test cases should be adequate. Changes may be required in fixtures and other framework.

Existing integration tests should continue to function seamlessly. This will be the litmus test of success.

Documentation Impact

None

References

http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005810.html

https://docs.openstack.org/openstacksdk/latest/user/config/configuration.html

http://eavesdrop.openstack.org/irclogs/%23openstack-sdks/%23openstack-sdks.2019-05-20.log.html#t2019-05-20T13:48:07

https://review.opendev.org/#/c/662881/

Items Implemented In Train

https://review.opendev.org/#/c/676926/
https://review.opendev.org/#/c/642899/
https://review.opendev.org/#/c/656027/
https://review.opendev.org/#/c/656028/
https://review.opendev.org/#/c/659690/
https://review.opendev.org/#/c/680649/
https://review.opendev.org/#/c/676837/

History

Revisions
Release Name	Description
Train	Introduced, Partially Implemented
Ussuri	Reintroduced

Nova - Cyborg Interaction

Fri, 08 Nov 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/nova-cyborg-interaction

This specification describes the Nova - Cyborg interaction needed to create and manage instances with accelerators, and the changes needed in Nova to accomplish that.

Problem description

Scope

Nova and Cyborg need to interact in many areas for handling instances with accelerators. While this spec covers the gamut, specific areas are covered in detail in other specs. We list all the areas below, identify which specific parts are covered by other specs, and describe what is covered in this spec.

Representation: Cyborg shall represent devices as nested resource providers under the compute node (except possibly for disaggregated servers), accelerator types as resource classes and accelerators as inventory in Placement. The properties needed for scheduling are represented as traits. This is specified by [1]. This spec does not dwell on this topic.
Discovery and Updates: Among the devices discovered in a host, Cyborg intends to claim only those that are not included under the PCI Whitelisting mechanism. Cyborg shall update Placement in a way that is compatible with the virt driver’s update of Placement. These aspects are addressed in sections Coexistence with PCI whitelists and Placement update respectively.
User requests for accelerators: Users usually request compute resources via flavors. However, since the requests for devices may be highly varied, placing them in flavors may result in flavor explosion. We avoid that by expressing device requests in a device profile [2] . The relationship between device profiles and flavors is explored in Section User requests.

When an instance creation (boot) request is made, the contents of a device profile shall be translated to request groups in the request spec; the syntax in request groups is covered in Section Updating the Request Spec.
Instance scheduling: Nova shall use the Placement data populated by Cyborg to schedule instances. This spec does not dwell on this topic.
Assignment of accelerators: We introduce the concept of Accelerator Request objects in Section Accelerator Requests. The workflow to create and use them is summarized in Section Nova changes for Assignment workflow. The same section also highlights the Nova changes needed.
Instance operations: The behavior with respect to accelerators for all standard instance operations are defined in [3]. This spec does not dwell on this topic.

Use Cases

A user requests an instance with one or more accelerators of different types assigned to it.
An operator may provide users with both Device as a Service or Accelerated Function as a Service in the same cluster (see [1]).

The following use cases are not addressed in this release but are of long term interest:

A user requests to add one or more accelerators to an existing instance.
Live migration with accelerators.

Proposed change

Coexistence with PCI whitelists

The operator tells Nova which PCI devices to claim and use by configuring the PCI Whitelists mechanism. In addition, the operator installs Cyborg drivers in compute nodes and configures/enables them. Those drivers may then discover and report some PCI devices. The operator must ensure that both configurations are compatible.

Ideally, there should be a single way for the operator to identify which PCI devices should be claimed by Nova and which by Cyborg. Until that is figured out, the operator shall use Cyborg’s configuration file to specify which Cyborg drivers are enabled. Since each driver claims specific PCI IDs, the operator can and must ensure that none of these PCI IDs are included in Nova’s PCI whitelist.

Placement update

Cyborg shall call Placement API directly to represent devices and accelerators. Some of the intended use cases for the API invocation are:

Create or delete child RPs under the compute node RP.
Create or delete custom RCs and custom traits.
Associate traits with RPs or remove such association.
Update RP inventory.

Cyborg shall not modify the RPs created by any other component, such as Nova virt drivers.

User requests

The user request for accelerators is encapsulated in a device profile [2], which is created and managed by the admin via the Cyborg API.

A device profile may be viewed as a ‘flavor for devices’. Accordingly, the instance request should include both a flavor and a device profile. However, that requires a change to the Nova API for instance creation. To mitigate the impact of such changes on users and operators, we propose to do this in phases.

In the initial phase, Nova API remains as today. The device profile is folded into the flavor as an extra spec by the operator, as below:

openstack flavor set --property 'accel:device_profile=<profile_name>' flavor

Thus the standard Nova API can be used to create an instance with only the flavor (without device profiles), like this:

openstack server create --flavor f ....  # instance creation

In the future, device profile may be used by itself to specify accelerator resources for the instance creation API.

Updating the Request Spec

When the user submits a request to create an instance, as described in Section User requests, Nova needs to call a Cyborg API, to get back the resource request groups in the device profile and merge them into the request spec. (This is along the lines of the scheme proposed for Neutron [4].)

This call, like all the others that Nova would make to Cyborg APIs, is done through a Keystone-based adapter that would locate the Cyborg service, similar to the way Nova calls Placement. A new Cyborg client module shall be added to Nova, to encapsulate such calls and to provide Cyborg-specific functionality.

VM images in Glance may be associated with image properties (other than image traits), such as bitstream/function IDs needed for that image. So, Nova should pass the VM image UUID from the request spec to Cyborg. This is TBD.

The groups in the device profile are numbered by Cyborg. The request groups that are merged into the request spec are numbered by Nova. These numberings would not be the same in general, i.e., the N-th device profile group may not correspond to the N-th request group in the request spec.

When the device profile request groups are added to other request groups in the flavor, the group_policy of the flavor shall govern the overall semantics of all request groups.

Accelerator Requests

An accelerator request (ARQ) is an object that represents the state of the request for an accelerator to be assigned to an instance. The creation and management of ARQs are handled by Cyborg, and ARQs are persisted in Cyborg database.

An ARQ, by definition, represents a request for a single accelerator. The device profile in the user request may have N request groups, each asking for M accelerators; then N * M ARQs will be created for that device profile.

When an ARQ is initially created by Cyborg, it is not yet associated with a specific host name or a device resource provider. So it is said to be in an unbound state. Subsequently, Nova calls Cyborg to bind the ARQ to a host name, a device RP UUID and an instance UUID. If the instance fails to spawn, Nova would unbind the ARQ without deleting it. On instance termination, Nova would delete the ARQs after unbinding them.

Each ARQ needs to be matched to the specific RP in the allocation candidate that Nova has chosen, before the ARQ is bound. The current Nova code maps request groups to RPs, while the Cyborg client module in Nova (cyborg-client-module) matches ARQs to request groups. The matching is done using the requester_id field in the RequestGroup object as below:

The order of request groups in a device profile is not significant, but it is preserved by Cyborg. Thus, each device profile request group has a unique index.
When the device profile request groups returned by Cyborg are added to the request spec, the requester_id field is set to ‘device_profile_<N>’ for the N-th device profile request group (starting from zero). The device profile name need not be included here because there is only one device profile per request spec.
When Cyborg creates an ARQ for a device profile, it embeds the device profile request group index in the ARQ before returning it to Nova.
The matching is done in two steps:
- Each ARQ is mapped to a specific request group in the request spec using the requester_id field.
- Each request group is mapped to a specific RP using the same logic as the Neutron bandwidth provider ([5]).

Nova changes for Assignment workflow

This section summarizes the workflow details for Phase 1. The changes needed in Nova are marked with NEW.

NEW: A Cyborg client module is added to nova (cyborg-client-module). All Cyborg API calls are routed through that.

The Nova API server receives a POST /servers API request with a flavor that includes a device profile name.
NEW: The Nova API server calls the Cyborg API GET /v2/device_profiles?name=$device_profile_name and gets back the device profile. The request groups in that device profile are added to the request spec.
The Nova scheduler invokes Placement and gets a list of allocation candidates. It selects one of those candidates and makes claim(s) in Placement. The Nova conductor then sends a RPC message build_and_run_instances to the Nova compute manager.
NEW: Nova compute manager calls the Cyborg API POST /v2/accelerator_requests with the device profile name. Cyborg creates a set of unbound ARQs for that device profile and returns them to Nova. (The call may originate from Nova conductor or the compute manager; that will be settled in code review.)
NEW: The Cyborg client in Nova matches each ARQ to the resource provider picked for that accelerator. See match-rp.
NEW: The Nova compute manager calls the Cyborg API PATCH /v2/accelerator_requests to bind the ARQ with the host name, device’s RP UUID and instance UUID. This is an asynchronous call which prepares or reconfigures the device in the background.

NEW: Cyborg, on completion of the bindings (successfully or otherwise), calls Nova’s POST /os-server-external-events API with:

{
   "events": [
      { "name": "accelerator-requests-bound",
        "tag": $device_profile_name,
        "server_uuid": $instance_uuid,
        "status": "completed" # or "failed"
      },
      ...
   ]
}

NEW: The Nova compute manager waits for the notification, subject to the timeout mentioned in Section Other deployer impact. It then calls the Cyborg REST API GET /v2/accelerator_requests?instance=<uuid>&bind_state=resolved.
NEW: The Nova virt driver uses the attach handles returned from the Cyborg call to compose PCI passthrough devices into the VM’s definition.
NEW: If there is any error after binding has been initiated, Nova must unbind the relevant ARQs by calling Cyborg API. It may then retry on another host or delete the (unbound) ARQs for the instance.

This flow is captured by the following sequence diagram, in which the Nova conductor and scheduler are together represented as the Nova controller.

Alternatives

It is possible to have an external agent create ARQs from device profiles by calling Cyborg, and then feed those pre-created ARQs to the Nova instance creation API, analogous to Neutron ports. We do not take that approach yet because it requires changes to Nova instance creation API.

It is possible to have the Nova virt driver poll for the Cyborg ARQ binding completion. That is not preferable, partly because that is not the pattern of interaction with other services like Neutron.

Data model impact

None

REST API impact

None. A new extra_spec key accel:device_profile_name is added to the flavor, but no API is modified.

Security impact

None

Notifications impact

Nova may choose to add additional notifications for Cyborg API calls.

Other end user impact

None

Performance Impact

The extra calls to Cyborg REST API may potentially impact Nova conductor/scheduler throughput. This has been mitigated by making some critical Cyborg operations as asynchronous tasks.

Other deployer impact

The deployer needs to set up the clouds.yaml file so that Nova can call the Cyborg REST API.

The deployer needs to configure a new tunable in nova-cpu.conf:

* arq_binding_timeout (integer): Time in seconds for Nova compute
  manager to wait for Cyborg to notify that ARQ binding is done.
  Timeout is fatal, i.e., VM startup is aborted with an exception.
  Default: 300.

Developer impact

The resource classes FPGA and PGPU have already been standardized. But, as new device types are proposed, they will be represented as custom RCs to begin with, but may get standardized later. Such standardization requires changes to os-resource-classes.

For end-to-end testing with tempest, Cyborg shall provide a fake driver which returns attach handles of type TEST_PCI. The Nova virt driver should ignore such attach handles, and create VMs as if such ARQs did not exist.

Upgrade impact

None

Implementation

Assignee(s)

Sundar Nadathur

Feature Liaison

Feature liaison:: efried

Work Items

See the steps marked NEW in Nova changes for Assignment workflow section.

Dependencies

None

Testing

There need to be unit tests and functional tests for the Nova changes. Specifically, there needs to be a functional test fixture that mocks the Cyborg API calls.

There need to be tempest tests for the end-to-end flow, including failure modes. The tempest tests should be targeted at a fake driver (in addition to real hardware, if any) and tied to the Nova Zuul gate.

Documentation Impact

Device profile creation needs to be documented in Cyborg, as noted in [2].

The need for operator to fold the device profile into the flavor needs to be documented.

References

History

Revisions
Release Name	Description
Train	Introduced
Ussuri	Re-proposed

Per Process Healthcheck endpoints

Mon, 04 Nov 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks

In many modern deployment frameworks, there is an expectation that an application can expose a health-check endpoint so that the binary status can be monitored. Nova currently does not provide a native way to inspect the health of its binaries which doesn’t help cloud monitoring and maintenance. While limited support exists for health checks via Oslo middleware for our WSGI based API binaries, this blueprint seeks to expose a local HTTP health-check endpoint to address this feature gap consistently for all Nova components.

Problem description

To monitor the health of a Nova service today requires experience to develop and implement a series of external heuristics to infer the state of the service binaries.

This can be as simple as checking the service status for those with heartbeats or can comprise monitoring log output via a watchdog and restarting the service if no output is detected after a protracted period. Processing the logs for known error messages and executing a remediation script or other methods that are easy to do incorrectly are also common.

This is also quite unfriendly to new Nova users who have not gained enough experience with operating Nova to know what warning signs they should look for such as inability to connect to the message bus. Nova developers however do know what some of the important health indicators are and can expose those as a local health-check endpoint that operators can use instead.

The existing Oslo middleware does not address this problem statement because:

It can only be used by the API and metadata binaries
The middleware does not tell you the service is alive if its hosted by a WSGI server like Apache since the middleware is executed independently from the WSGI application. i.e. the middleware can pass while the nova-api can’t connect to the DB and is otherwise broken.
The Oslo middleware in detailed mode leaks info about the host Python kernel, Python version and hostname which can be used to determine in the host is vulnerable to CVEs which means it should never be exposed to the Internet. e.g.

platform: 'Linux-5.15.2-xanmod1-tt-x86_64-with-glibc2.2.5',
python_version: '3.8.12 (default, Aug 30 2021, 16:42:10) \n[GCC 10.3.0]'

Use Cases

As an operator, I want a simple REST endpoint I can consume to know if a Nova process is healthy.

As an operator I want this health check to not impact the performance of the service so it can be queried frequently at short intervals.

As a deployment tool implementer, I want the health check to be local with no dependencies on other hosts or services to function so I can integrate it with service managers such as systemd or a container runtime like Docker.

As a packager, I would like the use of the health check endpoints to not require special clients or packages to consume them. cURL, socat, or netcat should be all that is required to connect to the health check and retrieve the service status.

As an operator I would like to be able to use health-check of the Nova API and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

Proposed change

Definitions

TTL: The time interval for which a health check item is valid.

pass: all health indicators are passing and their TTLs have not expired.

warn: any health indicator has an expired TTL or where there is a partial transient failure.

fail: any health indicator is reporting an error or all TTLs are expired.

Warn vs fail

In general if any of the health check indicators are failing then the service should be reported as fail however if the specific error condition is recoverable or only a partial failure the warn state can and should be used.

An example of this is a service that has lost a connection to the message bus. When the connection is lost it should go to the warn state, if the first attempt to reconnect fails it should go to the fail state. Transient failure should be considered warning but persistent errors should be escalated to failures.

In many cases external management systems will treat warn and fail as equivalent and raise an alarm or restart the service. While this spec does not specify how you should recover from a degraded state, it is important to include a human readable description of why the warn or fail state was entered.

Services in the warn state are still considered healthy in most cases but they may be about to fail soon or be partially degraded.

Code changes

A new top-level Nova health check module will be created to encapsulate the common code and data structure required to implement this feature.

A new health check manager class will be introduced which will maintain the health-check state and all functions related to retrieving, updating and summarizing that state.

The health check manager will be responsible for creating the health check endpoint when it is enabled in the nova.conf and exposing the health check over HTTP.

The initial implementation will support HTTP over TCP with optional support for UNIX domain sockets as a more secure alternative to be added later. The HTTP endpoint in both cases will be unauthenticated and the response will be in JSON format.

A new HealthcheckStausItem data class will be introduced to store an individual health check data-point. The HealtcheckStatusItem will contain the name of the health check, its status, the time it was recorded, and an optional output string that should be populated if the status is warn or fail.

A new decorator will be introduced that will automatically retrieve the reference to the healthcheck manager from the Nova context object and update the result based on whether the function decorated raises an exception or not. The exception list and healthcheck item name will be specifiable.

The decorator will accept the name of the health check as a positional argument and include the exception message as the output of the health check on failure. Note that the decorator will only support the pass or fail status for simplicity; where warn is appropriate a manual check should be written. If multiple functions act as indicators of the same capability the same name should be used.

e.g.

@healthcheck('database', [SQLAlchemyError])
def my_db_func(self):
    pass

@healthcheck('database', [SQLAlchemyError])
def my_other_db_func(self):
    pass

By default all exceptions will be caught and re-raised by the decorator.

The new REST health check endpoint exposed by this spec will initially only support one URL path /health. The /health endpoint will include a Cache-Control: max-age=<ttl> header as part of its response which can optionally be consumed by the client.

The endpoint may also implement a simple incrementing etag at a later date once the initial implementation is complete, if required. Initially adding an etag is not provided as the response is expected to be small and cheap to query, so etags do not actually provide much benefit form a performance perspective.

If implemented, the etag will be incremented whenever the service state changes and will reset to 0 when the service is restarted.

Additional URL paths may be supported in the future, for example to retrieve the running configuration or trigger the generation of Guru Meditation Reports or enable debug logging. However, any endpoint beyond /health is out of scope of this spec. / is not used for health check response to facilitate additional paths in the future.

Example output

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 200 OK
Content-Type: application/health+json
Cache-Control: max-age=3600
Connection: close

{
    "status": "pass",
    "version": "1.0",
    "serviceId": "e3c22423-cd7a-47dc-b6e9-e18d1a8b3bdf",
    "description": "nova-api",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks": {
        "message_bus": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "api_db": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"}
    }
}

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 503 Sevice Unavailable
Content-Type: application/health+json
Cache-Control: no-cache
Connection: close

{
    "status": "fail",
    "version": "1.0",
    "serviceId": "0a47dceb-11b1-4d94-8b9c-927d998be320",
    "description": "nova-compute",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks":{
        "message_bus":{"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "hypervisor":{
             "status": "fail", "time": "2021-12-17T16:05:55+00:00",
             "output": "Libvirt Error: ..."
        }
    }
}

Alternatives

Instead of maintaining the state of the process in a data structure and returning the cached state we, could implement the health check as a series of active probes such as checking the DB schema version to ensure we can access it or making a ping RPC call to the cell conductor or our own services RPC endpoint.

While this approach has some advantages it will have a negative performance impact if the health-check is queried frequently or in a large deployment where infrequent queries may still degrade the DB and message bus performance due to the scale of the deployment.

This spec initially suggested using OK, Degraded and Faulty as the values for the status field. These were updated to pass, warn and fail to align with the draft IETF RFC for health check response format for HTTP APIs [1].

Data model impact

The Nova context object will be extended to store a reference to the health check manager.

REST API impact

None

While this change will expose a new REST API endpoint it will not be part of the existing Nova API.

In the Nova API the /health check route will not initially be used to allow those that already enable the Oslo middleware to continue to do so. In a future release Nova reserves the right to add a /health check endpoint that may or may not correspond to the response format defined in Oslo. A translation between the Oslo response format and the health check module may be provided in the future but it is out of the scope of this spec.

Security impact

The new health check endpoint will be disabled by default. When enabled it will not provide any authentication or explicit access control. The documentation will detail that when enabled, the TCP endpoint should be bound to localhost and that file system permission should be used to secure the UNIX socket.

The TCP configuration option will not prevent binding it to a routable IP if the operator chooses to do so. The intent is that the data contained in the endpoint will be non-privileged however it may contain hostnames/FQDNs or other infrastructure information such as service UUIDs, so it should not be accessible from the Internet.

Notifications impact

None

While the health checks will use the ability to send notification as an input to determine the health of the system, this spec will not introduce any new notifications and as such it will not impact the Notification subsystem in Nova. New notifications are not added as this would incur a performance overhead.

Other end user impact

None

At present, it is not planned to extend the Nova client or the unified client to query the new endpoint. cURL, socat, or any other UNIX socket or TCP HTTP client can be used to invoke the endpoint.

Performance Impact

None

We expect there to be little or no performance impact as we will be taking a minimally invasive approach to add health indicators to key functions which will be cached in memory. While this will slightly increase memory usage there is no expected impact on system performance.

Other deployer impact

A new config section healthcheck will be added in the nova.conf

A uri config option will be introduced to enable the health check functionality. The config option will be a string opt that supports a comma-separated list of URIs with the following format

uri=<scheme>://[host:port|path],<scheme>://[host:port|path]

e.g.

[healthcheck]
uri=tcp://localhost:424242

[healthcheck]
uri=unix:///run/nova/nova-compute.sock

[healthcheck]
uri=tcp://localhost:424242,unix:///run/nova/nova-compute.sock

The URI should be limited to the following characters [a-zA-Z0-9_-], , is reserved as a separation character, . may only be used in IPv4 addresses, and : is reserved for port separation unless the address is an IPv6 address. IPv6 addresses must be enclosed in [ and ]. / may be used with the UNIX protocol however relative paths are not supported. These constraints and the parsing of the URI will be enforced and provided by the RFC3986 lib https://pypi.org/project/rfc3986/

A ttl IntOpt will be added with a default value of 300 seconds. If set to 0, the time to live of a health check item will be infinite. If the TTL expires, the state will be considered unknown and the healthcheck item will be discarded.

A cache_control IntOpt will be provided to set the max-age value in the cache_control header. By default it will have the same max-age as the TTL config option. Setting this to 0 will disable the reporting of the header. Setting this to -1 will report Cache-Control: no-cache. Any other positive integer value will be used as the max-age.

Developer impact

Developers should be aware of the new decorator and consider whether it should be added to more functions, if that function is an indicator of the system’s health. Failures due to interactions with external systems such as Neutron port binding external events should be handled with caution. While failure to receive a port binding event will likely result in the failure to boot a VM, it should not be used as a health indicator for the nova-compute agent. This is because such a failure may be due to a failure in Neutron, not Nova. As such, other operations such as VM snapshot may be unaffected and the Nova compute service may be otherwise healthy. Any failure to connect to a non-OpenStack service such as the message bus, hypervisor, or database should be treated as a warn or fail health indicator if it prevents the Nova binary from functioning correctly.

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: balazs-gibizer
Other contributors:: melwitt

Feature Liaison

Feature liaison:: balazs-gibizer

Work Items

Add new module
Introduce decorator
Extend context object to store a reference to health check manager
Add config options
Expose TCP endpoint
Expose UNIX socket endpoint support
Add docs

Dependencies

None

Testing

This can be tested entirely with unit and functional tests, however, Devstack will be extended to expose the endpoint and use it to determine whether the Nova services have started.

Documentation Impact

The config options will be documented in the config reference and a release note will be added for the feature.

A new health check section will be added to the admin docs describing the current response format and how to enable the feature and its intended usage. This document should be evolved whenever the format changes or new functionality is added beyond the scope of this spec.

References

Yoga PTG topic:
https://etherpad.opendev.org/p/r.e70aa851abf8644c29c8abe4bce32b81#L415

History

Revisions
Release Name	Description
Yoga	Introduced
2023.1 Antelope	Reproposed
2024.1 Caracal	Reproposed
2024.2 Dalmatian	Reproposed
2025.1 Epoxy	Reproposed

libvirt driver launching instances with memory encryption by AMD SEV-ES

Tue, 10 Sep 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/amd-sev-es-libvirt-support

Problem description

Note

Use Cases

As a cloud administrator, in order that my users can have more confidence in the security of their running instances, I want to provide an image with the specific properties or a flavor with the specific extra specs which will allow users to boot instances to ensure that their instances run on an SEV-ES-capable compute host with SEV-ES encryption, instead of SEV encryption, enabled.
As a cloud user, in order to reduce data leakage risks further, I want to be able to boot VM instances with SEV-ES functionality, instead of SEV functionality, enabled.

Proposed change

We propose extending the existing implementation to support launching instances with SEV functionality.

Add detection of host SEV-ES capabilities, which checks the following items.
- The presence of the following XML in the response from a libvirt virConnectGetDomainCapabilities() API call indicates that both QEMU and the AMD Secure Processor (AMD-SP) support SEV functionality:
```
<domainCapabilities>
  ...
  <features>
    ...
    <sev supported='yes'/>
      ...
    </sev>
  </features>
</domainCapabilities>
```
  Also the maxESGuests field should be present and its value should be a positive (non-zero) value.
- /sys/module/kvm_amd/parameters/sev_es should have the value Y to indicate that the kernel has SEV capabilities enabled. This should be readable by any user (i.e. even non-root).
- Check QEMU version to determine whether the available QEMU binary supports SEV-ES.
Add the new HW_CPU_AMD_SEV_ES trait to os-traits.

+------------+     +----------------------------+
| compute RP +--+--+ SEV RP                     |
+------------+  |  | trait:HW_CPU_AMD_SEV       |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +----------------------------+
                +--+ SEV-ES RP                  |
                   | trait:HW_CPU_AMD_SEV_ES    |
                   +------------------------+---+
                   | MEM_ENCRYPTION_CONTEXT | N |
                   +------------------------+---+

The SEV RP is named <nodename>_amd_sev and the SEV-ES RP is named <nodename>_amd_sev_es, so that the RP names are unique in the cluster.

Note

+------------+     +----------------------------+
| compute RP +--+--+ SEV RP                     |
+------------+  |  | trait:HW_CPU_AMD_SEV       |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +----------------------------+
                +--+ SEV-ES RP                  |
                |  | trait:HW_CPU_AMD_SEV_ES    |
                |  | trait:HW_CPU_AMD_SEV_SNP   |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +-----------------------------+
                +--+ SEV-SNP RP                  |
                   | trait:HW_CPU_AMD_SEV_SNP_CH |
                   +------------------------+----+
                   | MEM_ENCRYPTION_CONTEXT | N  |
                   +------------------------+----+

Add support for a new hw:mem_encryption_model parameter in flavor extra specs, and a new hw_mem_encryption_model image property. When either of these is set to amd-sev-es along with the parameter/propery to enable memory encryption, it would be internally translated to resources:MEM_ENCRYPTION_CONTEXT=1 and trait:HW_CPU_AMD_SEV_ES=required which would be added to the flavor extra specs in the RequestSpec object. If these new model parameter/property is absent or set to amd-sev then it would be translated to resources:MEM_ENCRYPTION_CONTEXT=1 and trait:HW_CPU_AMD_SEV=required. If conflicting models are requested by the instance flavor and the instance image (for example the flavor has hw:mem_encryption_model=amd-sev but the image has hw_mem_encryption_model=amd-sev-es) then the request is rejected. Also the request should be rejected when memory encryption is not requested but a memory encryption model is requested.
Change the libvirt driver to include extra XML in the guest’s domain definition when the hw:mem_encryption_model parameter in flavor extra spec or the hw_mem_encryption_model image property is present and is set to amd-sev-es. The extra XML is mostly similar to the one used in SEV, but its guest policy field needs the SEV-ES bit (bit 2) enabled.

Note

Alternatives

None

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

The end user will harness SEV-ES through the existing mechanisms of resources in flavor extra specs and image properties.

Also the limitations of AMD SEV-encrypted guest are applied when SEV-ES is used.

Performance Impact

No performance impact on nova is anticipated.

Performance impact for the other parts are same as the existing SEV support feature.

Other deployer impact

In order for users to be able to use SEV-ES, the operator will need to perform the following steps:

Deploy SEV-ES-capable hardware as nova compute hosts.
- AMD EPYC 7xx2 (Rome) or later
Set minimum ASID for SEV (non-ES) guests in BIOS (or UEFI) to a value greater than 0.

Note

If SEV-enabled instancs are already launched in the compute node, enough ASIDs should be reserved for SEV.
Ensure that they have an appropriately configured software stack, so that the various layers are all SEV-ES ready:
- kernel >= 4.16
- QEMU >= 6.0.0
- libvirt >= 8.0.0
- ovmf >= commit 7f0b28415cb4 2020-08-12
Note

SEV-ES enabled guests can be launched by libvirt >= 4.5, but detection of maximum number of SEV-ES guests via domain capability API requires libvirt >= 8.0.0 .

A cloud administrator will need to define SEV-ES-enabled flavors as described above, unless it is sufficient for users to define SEV-ES-enabled images.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: kajinamit (irc: tkajinam)
Other contributors:: None

Work Items

Add the new HW_CPU_AMD_SEV_ES trait for os-traits
Add detection of host SEV-ES capabilities as detailed above and reshaping of existing MEMO_ENCRYPTION_CONTEXT resource.
Add mem_encryption_model property to ImageMeta object
Update scheduler util to request MEM_ENCRYPTION_CONTEXT resource and HW_CPU_AMD_SEV_ES trait when the mem_encryption_model property or the equivalent flavor extra spec is set to amd-sev-es
Update libvirt driver to set the SEV-ES policy bit when the property is present.
Update image property schema in glance to validate the new mem_encryption_model property.
Update documentations.

Unit tests and functional tests should be added according to new logic.

Future work

None

Dependencies

Special hardware which supports SEV-ES for development, testing, and CI.
Recent versions of the hypervisor software stack which all support SEV-ES, as detailed in Other deployer impact above.

Testing

The fakelibvirt test driver will need adaptation to emulate SEV-ES-capable hardware.

Corresponding unit/functional tests will need to be extended or added to cover:

detection of SEV-ES-capable hardware and software, e.g. perhaps as an extension of nova.tests.functional.libvirt.test_report_cpu_traits.LibvirtReportTraitsTests
the use of a trait to include extra SEV-specific libvirt domain XML configuration, e.g. within nova.tests.unit.virt.libvirt.test_config

Documentation Impact

Update the entry in the Feature Support Matrix, to explain now AMD SEV-ES is supported in addition to AMD SEV.
Update the existing AMD SEV guide to include information about SEV-ES.

Other non-nova documentation should be updated too:

The documentation for os-traits should be extended where appropriate.

References

History

Revisions
Release Name	Description
2024.2 Dalmatian	Approved
2025.1 Epoxy	Re-proposed

libvit driver launching instances with stateless firmware

Mon, 09 Sep 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-stateless-firmware

Since v8.6.0, libvirt allows launching instance with stateless firmware, which disables the potential attack surface from hypervisor. This work aims to introduce the required feature to allow users to use this feature.

Problem description

Libvirt v8.6.0 introduced the new feature to launch instance with stateless firmware. When an instance is launched with this feature enabled along with UEFI, the instance uses a ready-only firmware image without NVRAM file. This feature is useful for confidential computing use case, because it prevents injection into firmware vars from hypervisor. It also allows more complete measurement of elements involved in the boot chain of the instance which is the key requirement of remote attestation. This is described in the libvirt guide.

However this libvirt feature can’t be enabled in instances launched by current nova, because nova does not set the stateless option. Also nova always injects nvram file into libvirt domain XML.

Use Cases

As a cloud administrator, in order that my users can have more confidence in the security of their running instances, I want to allow my users to enforce stateless firmware for their instances.
As a user, I want to prevent risk caused by firmware variables injected by hypevisor, for instances which load very confidential data.

Proposed change

We propose adding a new image property to request stateless firmwre, so that users can create their instance with stateless firmware.

Add the new COMPUTE_SECURITY_STATELESS_FIRMWARE trait to os-traits.
Make libvirt driver check the current version of libvirt and report the supports_stateless_firmware capability when the version is equal or newer than v8.6.0. This capability should be mapped to the COMPUTE_SECURITY_STATELESS_FIRMWARE trait.
Add the new hw_firmware_stateless image property, which accept boolean values and is false by default. If the property is set to true then nova translate it to requiring the COMPUTE_SECURITY_STATELESS_FIRMWARE trait.
Change the libvirt driver to adds the stateless option to the loader element of libvirt domain XML and skip injecting nvram file, if instance metadata of the instance contains hw_firmware_stateless property set to true.

Alternatives

None

Data model impact

A new trait and new image property will be used to present availability and request of stateless firmware feature in libvirt.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

The end user will be able to use statless firmware for their instances through the existing image property mechanism.

Performance Impact

None

Other deployer impact

In order for users to be able to use this feature, the operator will need to deploy libvirt v8.6.0 or later in the deployment.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: kajinamit (irc: tkajinam)
Other contributors:: None

Work Items

Add the new COMPUTE_SECURITY_STATELESS_FIRMWARE trait to os-traits.
Make libvirt driver check libvirt version and present availability of stateless firmware in compute node capabilities, as the COMPUTE_SECURITY_STATELESS_FIRMWARE trait, based on the detected version.
Add the new hw_firmware_stateless image property to the ImageMeta object
Update scheduler util to require COMPUTE_SECURITY_STATELESS_FIRMWARE trait when the hw_firmware_stateless property in instance image properties is set to true
Make libvirt driver set stateless="yes" in the loder element when instance image properties contains the hw_firmware_stateless property set to true.
Update documentations
Update image property schema in glance to validate the new hw_firmware_stateless property.

Unit tests and functional tests should be added according to new logic.

Future work

None

Dependencies

Libvirt v8.6.0 or later.

Testing

The fakelibvirt test driver will need adaptation to emulate libvirt older than v8.6.0 and libvirt v8.6.0 or later.

Corresponding unit/functional tests will need to be extended or added to cover:

detection of the statless firmware support by libvirt
the use of a trait to include extra stateless loader option in domain XML configuration.

Documentation Impact

Update the Feature Support Matrix, to include stateless firmware support.
Update the existing AMD SEV guide to include information about stateless firmware.

References

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced

OpenAPI Schemas

Mon, 09 Sep 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/openapi

Problem description

Use Cases

As an end user, I would like to have access to machine-readable, fully validated documentation for the APIs I will be interacting with.

As an end user, I want statically viewable documentation hosted as part of the existing docs site without requiring a running instance of Nova.

As an SDK/client developer, I would like to be able to auto-generate bindings and clients, promoting consistency and minimising the amount of manual work needed to develop and maintain these.

As a Nova developer, I would like to have a verified API specification that I can use should I need to replace the web framework/libraries we use in the event they are no longer maintained.

Proposed change

This effort can be broken into a number of distinct steps:

Add a new decorator for removed APIs and actions

We have a number of APIs and actions that no longer have backing code and return HTTP 410 (Gone) or HTTP 400 (Bad Request), respectively. We will not add schemas for these in the initial attempt at this so we need some mechanism to indicate this. We will add a new removed decorator that will highlight these removed APIs and indicate the version they were removed in and the reason for their removal. We can later use this as a heuristic in our tests to skip schema checks for these methods.
Add missing request body and query string schemas

There is already good coverage of both request bodies and query string parameters but it is not complete. A list of incomplete schemas is given at the end of this section. The additional schemas will merely validate what is already allowed, which will mean extensive use of "additionalProperties": true or empty schemas. Put another way, an API that currently ignores unexpected request body fields or query string parameters will continue to ignore them. We may wish to make these stricter, as we did for most APIs in microversion 2.75, but that is a separate issue that should be addressed separately.

Once these specs are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate schemas.
Add response body schemas

These will be sourced from existing OpenAPI schemas, currently published at github.com/gtema/openstack-openapi, from Tempest’s API schemas, and where necessary from new schemas auto-generated from JSON response bodies generated in tests and manually modified handle things like enum values.

Once these are added, tests will be added to ensure all non-deprecated and non-removed API resources have appropriate response body schemas. In addition, we will add a new configuration option that will control how we do verification at the API layer, [api] response_validation. This will be an enum value with three options:

error
Raise a HTTP 500 (Server Error) in the event that an API returns an “invalid” response.

This will be the default in CI i.e. for our unit, functional and integration tests. This should not be used in production. The help text of the option will indicate this and we will set the advanced option.

warn
Log a warning about an “invalid” response, prompting operations to file a bug report against Nova.

This will be initial (and likely forever) default in production.

ignore
Disable API response body validation entirely. This is an escape hatch in case we mess up.

Note

Alternatives

Use a different tool

We could use a different tool than OpenAPI to publish our specs. In a manner of speaking we already do this - albeit not in a machine-readable manner - through our use of os-api-ref.

This idea has been rejected because OpenAPI is clearly the best tool for the It is the most widely used API description language available today and aligns well with our existing use of JSON Schema for API validation. While it does not support OpenStack’s microversion API design pattern out-of-the-box, previous experiments have demonstrated that it is extensible enough to add this.
Maintain these specs out-of-tree

We could use a separate repo to store and maintain specs for Nova and the other OpenStack services.

This idea has been rejected because it prevents us testing the specs on each commit to Nova and means work that could be spread across multiple teams is instead focused on one small team. It will result in more bugs and a lag between changes to the Nova API and changes to the out-of-tree specs. It will result in duplication of effort across Nova, Tempest, and the specs projects.
Publish the spec via an API resource rather than in our docs

We could publish the spec via a new, unversioned API endpoint such as /spec. A GET request to this would return the full spec, either statically generated at deployment time or dynamically generated (and then cached) at runtime.

This is rejected because it brings limited advantages and multiple disadvantages. Nova’s API is designed to be backwards-compatible and non-extensible. As such, a user with the latest version of the spec should be able to use it to communicate with any OpenStack deployment running a version of Nova that supports microversions. It is also expected that the “master” version of the spec will continuously improve as things are tightened up, documentation is improved, and bugs or mistakes are corrected. We want consumers of the spec to see these changes immediately rather than wait for their deployment to be updated. Finally, OpenStack’s previous forays into discoverable APIs, such as Keystone’s use of JSONHome or Glance’s attempts to publish resource schemas, have seen limited take-up outside of the projects themselves. Taken together, this all suggests there is no reason or advantage to publishing deployment-specific specs and users would be better served by fetching the latest version of the spec from the api-ref documentation published on docs.openstack.org (which, one should note, is itself intentionally unversioned).

Data model impact

None.

REST API impact

We may wish to address issues that are uncovered as we add schemas, but this work is considered secondary to this effort and can be tackled separately.

Security impact

None.

Notifications impact

None.

Other end user impact

Performance Impact

Other deployer impact

Developer impact

Developers working on the API microversions will now be encouraged to provide JSON Schema schemas for both requests and responses.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: stephenfinucane
Other contributors:: gtema

Feature Liaison

None.

Work Items

Add missing request body schemas
Add tests to validate existence of request body schemas
Add missing query string schemas
Add tests to validate existence of query string schemas
Add response body schemas
Add decorator to validate response body schemas against response
Add tests to validate existence of response body schemas

Dependencies

Testing

Unit tests will ensure that schemas eventually exist for request bodies, query strings, and response bodies.

Unit, functional and integration tests will all work together to ensure that response body schemas match real responses by setting [api] response_validation to error.

Documentation Impact

References

APIs missing schemas

These are the APIs that are currently (as of 2024-04-11, commit 1bca24aeb) missing API request body schemas and query string schemas.

Missing request body schemas

AdminActionsController._inject_network_info
AdminActionsController._reset_network
AgentController.create
AgentController.update
BareMetalNodeController._add_interface
BareMetalNodeController._remove_interface
BareMetalNodeController.create
CellsController.create
CellsController.sync_instances
CellsController.update
CertificatesController.create
CloudpipeController.create
CloudpipeController.update
ConsolesController.create
DeferredDeleteController._force_delete
DeferredDeleteController._restore
FixedIPController.reserve
FixedIPController.unreserve
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPBulkController.create
FloatingIPBulkController.update
FloatingIPController.create
FloatingIPDNSDomainController.update
FloatingIPDNSEntryController.update
LockServerController._unlock
NetworkAssociateActionController._associate_host
NetworkAssociateActionController._disassociate_host_only
NetworkAssociateActionController._disassociate_project_only
NetworkController._disassociate_host_and_project
NetworkController.add
NetworkController.create
PauseServerController._pause
PauseServerController._unpause
RemoteConsolesController.get_rdp_console
RescueController._unrescue
SecurityGroupActionController._addSecurityGroup
SecurityGroupActionController._removeSecurityGroup
SecurityGroupController.create
SecurityGroupController.update
SecurityGroupDefaultRulesController.create
SecurityGroupRulesController.create
ServersController._action_confirm_resize
ServersController._action_revert_resize
ServersController._start_server
ServersController._stop_server
ShelveController._shelve
ShelveController._shelve_offload
SuspendServerController._resume
SuspendServerController._suspend
TenantNetworkController.create

Missing request query string schemas

AgentController.index
AggregateController.index
AggregateController.show
AvailabilityZoneController.detail
AvailabilityZoneController.index
BareMetalNodeController.index
BareMetalNodeController.show
CellsController.capacities
CellsController.detail
CellsController.index
CellsController.info
CellsController.show
CertificatesController.show
CloudpipeController.index
ConsoleAuthTokensController.show
ConsolesController.index
ConsolesController.show
ExtensionInfoController.index
ExtensionInfoController.show
FixedIPController.show
FlavorAccessController.index
FlavorExtraSpecsController.index
FlavorExtraSpecsController.show
FlavorsController.show
FloatingIPBulkController.index
FloatingIPBulkController.show
FloatingIPController.index
FloatingIPController.show
FloatingIPDNSDomainController.index
FloatingIPDNSEntryController.show
FloatingIPPoolsController.index
FpingController.index
FpingController.show
HostController.reboot
HostController.show
HostController.shutdown
HostController.startup
HypervisorsController.detail
HypervisorsController.index
HypervisorsController.search
HypervisorsController.servers
HypervisorsController.show
HypervisorsController.statistics
HypervisorsController.uptime
IPsController.index
IPsController.show
ImageMetadataController.index
ImageMetadataController.show
ImagesController.detail
ImagesController.index
ImagesController.show
InstanceActionsController.index
InstanceActionsController.show
InstanceUsageAuditLogController.index
InstanceUsageAuditLogController.show
InterfaceAttachmentController.index
InterfaceAttachmentController.show
NetworkController.index
NetworkController.show
QuotaClassSetsController.show
QuotaSetsController.defaults
QuotaSetsController.detail
QuotaSetsController.show
SecurityGroupController.show
SecurityGroupDefaultRulesController.index
SecurityGroupDefaultRulesController.show
ServerDiagnosticsController.index
ServerGroupController.show
ServerMetadataController.index
ServerMetadataController.show
ServerMigrationsController.index
ServerMigrationsController.show
ServerPasswordController.index
ServerSecurityGroupController.index
ServerTagsController.index
ServerTagsController.show
ServerTopologyController.index
ServerVirtualInterfaceController.index
ServersController.show
SnapshotController.show
TenantNetworkController.index
TenantNetworkController.show
VersionsController.show
VolumeAttachmentController.show
VolumeController.show

Note

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced

libvirt driver launching instances with memory encryption by AMD SEV-ES

Sun, 21 Jul 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/amd-sev-es-libvirt-support

Problem description

Note

Use Cases

As a cloud administrator, in order that my users can have more confidence in the security of their running instances, I want to provide an image with the specific properties or a flavor with the specific extra specs which will allow users to boot instances to ensure that their instances run on an SEV-ES-capable compute host with SEV-ES encryption, instead of SEV encryption, enabled.
As a cloud user, in order to reduce data leakage risks further, I want to be able to boot VM instances with SEV-ES functionality, instead of SEV functionality, enabled.

Proposed change

We propose extending the existing implementation to support launching instances with SEV functionality.

Add detection of host SEV-ES capabilities, which checks the following items.
- The presence of the following XML in the response from a libvirt virConnectGetDomainCapabilities() API call indicates that both QEMU and the AMD Secure Processor (AMD-SP) support SEV functionality:
```
<domainCapabilities>
  ...
  <features>
    ...
    <sev supported='yes'/>
      ...
    </sev>
  </features>
</domainCapabilities>
```
  Also the maxESGuests field should be present and its value should be a positive (non-zero) value.
- /sys/module/kvm_amd/parameters/sev_es should have the value Y to indicate that the kernel has SEV capabilities enabled. This should be readable by any user (i.e. even non-root).
- Check QEMU version to determine whether the available QEMU binary supports SEV-ES.
Add the new HW_CPU_AMD_SEV_ES trait to os-traits.

+------------+     +----------------------------+
| compute RP +--+--+ SEV RP                     |
+------------+  |  | trait:HW_CPU_AMD_SEV       |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +----------------------------+
                +--+ SEV-ES RP                  |
                   | trait:HW_CPU_AMD_SEV_ES    |
                   +------------------------+---+
                   | MEM_ENCRYPTION_CONTEXT | N |
                   +------------------------+---+

The SEV RP is named <nodename>_amd_sev and the SEV-ES RP is named <nodename>_amd_sev_es, so that the RP names are unique in the cluster.

Note

+------------+     +----------------------------+
| compute RP +--+--+ SEV RP                     |
+------------+  |  | trait:HW_CPU_AMD_SEV       |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +----------------------------+
                +--+ SEV-ES RP                  |
                |  | trait:HW_CPU_AMD_SEV_ES    |
                |  | trait:HW_CPU_AMD_SEV_SNP   |
                |  +------------------------+---+
                |  | MEM_ENCRYPTION_CONTEXT | N |
                |  +------------------------+---+
                |
                |  +-----------------------------+
                +--+ SEV-SNP RP                  |
                   | trait:HW_CPU_AMD_SEV_SNP_CH |
                   +------------------------+----+
                   | MEM_ENCRYPTION_CONTEXT | N  |
                   +------------------------+----+

Add support for a new hw:mem_encryption_model parameter in flavor extra specs, and a new hw_mem_encryption_model image property. When either of these is set to amd-sev-es along with the parameter/propery to enable memory encryption, it would be internally translated to resources:MEM_ENCRYPTION_CONTEXT=1 and trait:HW_CPU_AMD_SEV_ES=required which would be added to the flavor extra specs in the RequestSpec object. If these new model parameter/property is absent or set to amd-sev then it would be translated to resources:MEM_ENCRYPTION_CONTEXT=1 and trait:HW_CPU_AMD_SEV=required. If conflicting models are requested by the instance flavor and the instance image (for example the flavor has hw:mem_encryption_model=amd-sev but the image has hw_mem_encryption_model=amd-sev-es) then the request is rejected. Also the request should be rejected when memory encryption is not requested but a memory encryption model is requested.
Change the libvirt driver to include extra XML in the guest’s domain definition when the hw:mem_encryption_model parameter in flavor extra spec or the hw_mem_encryption_model image property is present and is set to amd-sev-es. The extra XML is mostly similar to the one used in SEV, but its guest policy field needs the SEV-ES bit (bit 2) enabled.

Note

Alternatives

None

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

The end user will harness SEV-ES through the existing mechanisms of resources in flavor extra specs and image properties.

Also the limitations of AMD SEV-encrypted guest are applied when SEV-ES is used.

Performance Impact

No performance impact on nova is anticipated.

Performance impact for the other parts are same as the existing SEV support feature.

Other deployer impact

In order for users to be able to use SEV-ES, the operator will need to perform the following steps:

Deploy SEV-ES-capable hardware as nova compute hosts.
- AMD EPYC 7xx2 (Rome) or later
Set minimum ASID for SEV (non-ES) guests in BIOS (or UEFI) to a value greater than 0.

Note

If SEV-enabled instancs are already launched in the compute node, enough ASIDs should be reserved for SEV.
Ensure that they have an appropriately configured software stack, so that the various layers are all SEV-ES ready:
- kernel >= 4.16
- QEMU >= 6.0.0
- libvirt >= 8.0.0
- ovmf >= commit 7f0b28415cb4 2020-08-12
Note

SEV-ES enabled guests can be launched by libvirt >= 4.5, but detection of maximum number of SEV-ES guests via domain capability API requires libvirt >= 8.0.0 .

A cloud administrator will need to define SEV-ES-enabled flavors as described above, unless it is sufficient for users to define SEV-ES-enabled images.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: kajinamit (irc: tkajinam)
Other contributors:: None

Work Items

Add the new HW_CPU_AMD_SEV_ES trait for os-traits
Add detection of host SEV-ES capabilities as detailed above and reshaping of existing MEMO_ENCRYPTION_CONTEXT resource.
Add mem_encryption_model property to ImageMeta object
Update scheduler util to request MEM_ENCRYPTION_CONTEXT resource and HW_CPU_AMD_SEV_ES trait when the mem_encryption_model property or the equivalent flavor extra spec is set to amd-sev-es
Update libvirt driver to set the SEV-ES policy bit when the property is present.
Update image property schema in glance to validate the new mem_encryption_model property.
Update documentations.

Unit tests and functional tests should be added according to new logic.

Future work

None

Dependencies

Special hardware which supports SEV-ES for development, testing, and CI.
Recent versions of the hypervisor software stack which all support SEV-ES, as detailed in Other deployer impact above.

Testing

The fakelibvirt test driver will need adaptation to emulate SEV-ES-capable hardware.

Corresponding unit/functional tests will need to be extended or added to cover:

detection of SEV-ES-capable hardware and software, e.g. perhaps as an extension of nova.tests.functional.libvirt.test_report_cpu_traits.LibvirtReportTraitsTests
the use of a trait to include extra SEV-specific libvirt domain XML configuration, e.g. within nova.tests.unit.virt.libvirt.test_config

Documentation Impact

Update the entry in the Feature Support Matrix, to explain now AMD SEV-ES is supported in addition to AMD SEV.
Update the existing AMD SEV guide to include information about SEV-ES.

Other non-nova documentation should be updated too:

The documentation for os-traits should be extended where appropriate.

References

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced

Example Spec - The title of your blueprint

Tue, 16 Jul 2024 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2025.1 Epoxy	Introduced

Config option to control behavior of unset unified limits

Wed, 10 Jul 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/unified-limits-nova-unset-limits

Problem description

We want to be able to change the default quota driver to the UnifiedLimitsDriver, but the aforementioned behavior raises concerns about changing the default.

Use Cases

As an admin/operator, I would like to be warned if I have missed setting a registered limit for a resource in Keystone rather than having all API requests involving that resource be rejected for being over quota

Proposed change

The proposal in this spec is to add a new configuration option [quota]strict_unified_limits which defaults to True. When set to True, the Nova API will use the native oslo.limit behavior of considering unset unified limits as zero. When set to False, the Nova API will consider unset unified limits as unlimited or “don’t care”. When set to True, the Nova API will use the native oslo.limit behavior of considering unset unified limits as zero.

The only exception to [quota]strict_unified_limits = False is if there are not registered limits set at all. Registered limits are default limits that are global to the deployment and apply in any case that a project-specific limit has not been set. If unified limits are enabled but no registered limits have been set, all quota checks will fail and log a warning message about the total absence of any limits set every time quota is enforced. The combination of unified limits enabled but no unified limits set is considered to be an error state and not something the admin/operator has intended. We could also consider failing to start the nova-api and nova-conductor services if unified limits are enabled but no limits are set.

The idea of the proposed config option is to give the admin/operator some flexibility to resolve a situation where not all resources have registered limits set without immediately rejecting API requests. Of course, there will be the risk of potentially allowing allocation of more resources than would be desired until the admin/operator either sets registered limits or disables unified limits quotas. A warning will be logged every time quota is enforced for resources without registered limits set because we don’t want or expect unset limits to be a permanent state. The admin/operator can stop the warning logs by setting registered limits for the resources listed in the warning message.

Alternatives

Alternatively a change could be made to the oslo.limit library to handle missing registered limits differently [1]. This would be more difficult because oslo.limit has established default behavior and providing new behavior desirable for all projects may not be realistic.

Data model impact

None

REST API impact

None

Security impact

If [quota]strict_unified_limits is set to False, resources could be allocated beyond what the admin/operator would have intended during the window of time between the logging of the warning and the admin/operator taking action to either set registered limit(s) or disable unified limits quotas.

Notifications impact

None

Other end user impact

As part of this work, the nova-manage limits migrate_to_unified_limits CLI command will be enhanced to scan the database for resources in flavors that do not have registered limits set and show them in the output. The intent is to help admins/operators catch all resources and set limits for them before unified limits quotas are enabled.

Performance Impact

The performance impact of having [quota]strict_unified_limits set to False should be relatively small as it adds one extra Keystone API call each time a quota check fails and the limit for the associated resource is 0.

Other deployer impact

Admin/operators will need to be prepared and set [quota]strict_unified_limits to False _before_ upgrading Nova if they wish to relax quota checks initially when enabling unified limits quotas.

Developer impact

None

Upgrade impact

The [quota]strict_unified_limits config option would only impact an upgrade if the admin/operator sets it to True at the same time they enable unified limits quotas by using the UnifiedLimitsDriver.

If a deployer decides to switch to the UnifiedLimitsDriver during their upgrade and set [quota]strict_unified_limits to False before upgrading, there is a possibility that resources could be allocated beyond what the deployer would have intended until they take action on the logged warnings and set registered limits for resources missing limits.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: None

Feature Liaison

Feature liaison:: melwitt

Work Items

Add a configuration option to control whether unset unified limits should be considered unlimited and logged as a warning
Augment the nova-manage limits migrate_to_unified_limits command to scan database flavors to detect resources that do not have registered limits set and show them in the output to the user to let them know which limits they need to set

Dependencies

Related to https://specs.openstack.org/openstack/nova-specs/specs/yoga/implemented/unified-limits-nova.html

Testing

The functionality of the [quota]strict_unified_limits config option will be tested by writing new functional tests.

Documentation Impact

The unified limits documentation will be updated to include information about the new config option.

References

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced

Per Process Healthcheck endpoints

Fri, 10 May 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks

Problem description

To monitor the health of a Nova service today requires experience to develop and implement a series of external heuristics to infer the state of the service binaries.

The existing Oslo middleware does not address this problem statement because:

It can only be used by the API and metadata binaries
The middleware does not tell you the service is alive if its hosted by a WSGI server like Apache since the middleware is executed independently from the WSGI application. i.e. the middleware can pass while the nova-api can’t connect to the DB and is otherwise broken.
The Oslo middleware in detailed mode leaks info about the host Python kernel, Python version and hostname which can be used to determine in the host is vulnerable to CVEs which means it should never be exposed to the Internet. e.g.

platform: 'Linux-5.15.2-xanmod1-tt-x86_64-with-glibc2.2.5',
python_version: '3.8.12 (default, Aug 30 2021, 16:42:10) \n[GCC 10.3.0]'

Use Cases

As an operator, I want a simple REST endpoint I can consume to know if a Nova process is healthy.

As an operator I want this health check to not impact the performance of the service so it can be queried frequently at short intervals.

As an operator I would like to be able to use health-check of the Nova API and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

Proposed change

Definitions

TTL: The time interval for which a health check item is valid.

pass: all health indicators are passing and their TTLs have not expired.

warn: any health indicator has an expired TTL or where there is a partial transient failure.

fail: any health indicator is reporting an error or all TTLs are expired.

Warn vs fail

Services in the warn state are still considered healthy in most cases but they may be about to fail soon or be partially degraded.

Code changes

A new top-level Nova health check module will be created to encapsulate the common code and data structure required to implement this feature.

A new health check manager class will be introduced which will maintain the health-check state and all functions related to retrieving, updating and summarizing that state.

The health check manager will be responsible for creating the health check endpoint when it is enabled in the nova.conf and exposing the health check over HTTP.

e.g.

@healthcheck('database', [SQLAlchemyError])
def my_db_func(self):
    pass

@healthcheck('database', [SQLAlchemyError])
def my_other_db_func(self):
    pass

By default all exceptions will be caught and re-raised by the decorator.

If implemented, the etag will be incremented whenever the service state changes and will reset to 0 when the service is restarted.

Example output

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 200 OK
Content-Type: application/health+json
Cache-Control: max-age=3600
Connection: close

{
    "status": "pass",
    "version": "1.0",
    "serviceId": "e3c22423-cd7a-47dc-b6e9-e18d1a8b3bdf",
    "description": "nova-api",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks": {
        "message_bus": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "api_db": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"}
    }
}

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 503 Sevice Unavailable
Content-Type: application/health+json
Cache-Control: no-cache
Connection: close

{
    "status": "fail",
    "version": "1.0",
    "serviceId": "0a47dceb-11b1-4d94-8b9c-927d998be320",
    "description": "nova-compute",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks":{
        "message_bus":{"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "hypervisor":{
             "status": "fail", "time": "2021-12-17T16:05:55+00:00",
             "output": "Libvirt Error: ..."
        }
    }
}

Alternatives

Data model impact

The Nova context object will be extended to store a reference to the health check manager.

REST API impact

None

While this change will expose a new REST API endpoint it will not be part of the existing Nova API.

Security impact

Notifications impact

None

Other end user impact

None

At present, it is not planned to extend the Nova client or the unified client to query the new endpoint. cURL, socat, or any other UNIX socket or TCP HTTP client can be used to invoke the endpoint.

Performance Impact

None

Other deployer impact

A new config section healthcheck will be added in the nova.conf

A uri config option will be introduced to enable the health check functionality. The config option will be a string opt that supports a comma-separated list of URIs with the following format

uri=<scheme>://[host:port|path],<scheme>://[host:port|path]

e.g.

[healthcheck]
uri=tcp://localhost:424242

[healthcheck]
uri=unix:///run/nova/nova-compute.sock

[healthcheck]
uri=tcp://localhost:424242,unix:///run/nova/nova-compute.sock

Developer impact

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: sean-k-mooney
Other contributors:: melwitt

Feature Liaison

Feature liaison:: sean-k-mooney

Work Items

Add new module
Introduce decorator
Extend context object to store a reference to health check manager
Add config options
Expose TCP endpoint
Expose UNIX socket endpoint support
Add docs

Dependencies

None

Testing

This can be tested entirely with unit and functional tests, however, Devstack will be extended to expose the endpoint and use it to determine whether the Nova services have started.

Documentation Impact

The config options will be documented in the config reference and a release note will be added for the feature.

References

Yoga PTG topic:
https://etherpad.opendev.org/p/r.e70aa851abf8644c29c8abe4bce32b81#L415

History

Revisions
Release Name	Description
Yoga	Introduced
2023.1 Antelope	Reproposed
2024.1 Caracal	Reproposed
2024.2 Dalmatian	Reproposed

Use extend volume completion action

Fri, 26 Apr 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/assisted-volume-extend

This blueprint proposes to use the os-extend_volume_completion volume action that has been proposed for Cinder in [3], to provide feedback on success or failure when handling volume-extended external server events.

Problem description

Many remotefs-based volume drivers in Cinder use the qemu-img resize command to extend volume files. However, when the volume is attached to a guest, QEMU will lock the file and qemu-img will be unable to resize it.

In this case, only the QEMU process holding the lock can resize the volume, which can be triggered through the QEMU monitor command block-resize.

There is currently no adequate way for Cinder to use this feature, so the NFS, NetApp NFS, Powerstore NFS, and Quobyte volume drivers all disable extending attached volumes.

Use Cases

As a user, I want to extend a NFS/NetApp NFS/Powerstore NFS/Quobyte volume while it is attached to an instance and I want the volume size and status to reflect the success or failure of the operation.

Proposed change

Nova’s libvirt driver uses the block-resize command when handling the volume-extended external server event, to inform QEMU that the size of an attached volume has changed. It is in principle also capable of extending a volume file, but is currently unable to provide feedback to Cinder on the success of the operation.

Currently, Cinder will send the volume-extended external server event to Nova only after it has finalized the extend operation and reset the volume status from extending back to in-use.

With [3], Cinder will allow volume drivers to hold off finalizing the extend operation and leave the volume status as extending, until after it has send the volume-extended event and received feedback from Nova in form of the os-extend_volume_completion volume action, with an error argument indicating whether to finalize or to roll back the operation.

This will currently affect only the volume drivers mentioned above, all of which did not previously support online extend. All other drivers will continue to send the volume-extended event after finalizing the operation and resetting to in-use status, and will not expect a os-extend_volume_completion volume action.

Compute Agent

Nova’s compute agent will use the volume status to differentiate between the two behaviors when handling volume-extended events:

If the volume status is extending, then it will attempt to read extend_new_size from the volume’s metadata and use this value as the new size of the volume, instead of the volume size field.

After successfully extending the volume, it will call the extend volume completion action of the volume, with "error": false.

If anything goes wrong, including extend_new_size being missing from the metadata, or being smaller than the current size of the volume, it will log the error and call the os-extend_volume_completion action with "error": true, so Cinder can roll back the operation.
For any other volume status, including in-use, the event will be handled as before.

API

Nova’s API will introduce a new microversion, so that Cinder can make sure the new behavior is available, before leaving an extend operation unfinished.

To handle older compute agents during a rolling upgrade, the API will also check the compute service version of the target agent when receiving a volume-extended event with the new microversion. If a target compute agent is too old to support the feature, the API will discard the event and call the os-extend_volume_completion action with "error": true.

Alternatives

A previous change tried to use the volume-extended external server event to support online extend for the NFS driver [1], but did not rely on feedback from Nova to Cinder at all. Instead, it would just set the new size of the volume, change the status back to in-use, notify Nova, and hope for the best.

If anything went wrong on Nova’s side, this would still result in a volume state indicating that the operation was successful, which is not acceptable.
A previous version of this spec proposed a new synchronous API in Nova [2], that would directly call CompVirtAPI.extend_image of the nova-compute instance managing the guest that a volume was attached to. This API would provide a single mechanism to trigger the resize operation, communicate the new size to Nova, and get feedback on the success of the operation.

The problem with a synchronous API is, that RPC and API timeouts limit the maximum time an extend operation can take. For QEMU, this seemed to be acceptable, because storage preallocation is hard disabled for the block-resize command, and because all currently plausible file systems support sparse file operations.

However, this may not be true for other volume or virt drivers that might require this API in the future. It would also break with the established pattern of asynchronous coordination between Nova and Cinder, which includes the assisted snapshot and volume migration features.
Following this pattern, we could make the proposed API asynchronous and use a new callback in Cinder, similar to Nova’s os-assisted-volume-snapshots API, which uses the os-update_snapshot_status snapshot action to provide feedback to Cinder.

The function of the new Nova API would then just be to trigger the operation and to communicate the new size. The question is then, whether that warrants adding a new API to Nova, since there are existing mechanisms that could be used for either.
The existing mechanism for triggering the extend operation in Nova is of course the volume-extended external server event. Using it for this purpose, as this spec proposes, requires the target size to be transferred separately, because external server events only have a single text field that is freely usable, which for volume-extended is already used for the volume ID.

Besides storing it in the admin metadata, as [3] and this spec propose, there is also the option of updating the size field of the volume, as [1] was essentially doing.

This would require the volume size field to be reset on a failure. If an error response from Nova was lost, the volume would just keep the new size. We would need to extend os-reset_status to allow a size reset, or something similar to clean up volumes like this. This would be possible, but updating the size field only after the volume was successfully extended seems like a cleaner solution.
We could also extend the external server event API to accept additional data for events, and use this to communicate the new size to Nova.

This option was judged favorably by reviewers on the previous version of this spec, [2], but it would be a more complex change to the Nova API.

However, if additional data fields become available in a future version of the external server event API, it would be a relatively minor change to use this instead of volume metadata.

Data model impact

None

REST API impact

The behavior of the external server event API will change.

If Nova receives a volume-extended event, and the referenced volume has status of extending, Nova will look for the extend_new_size key in the volume metadata, and use this instead of the volume size field as the target size to update the block device mapping and to pass to the virt driver’s extend_volume method.

Nova will also attempt to call Cinder’s new os-extend_volume_completion volume action proposed in [3] to let Cinder know if the operation was successful or not.
Otherwise, the API will behave as before.

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

Checking the target compute service version allows the API to handle rolling upgrades gracefully.

Implementation

Assignee(s)

Primary assignee:: kgube
Other contributors:: None

Feature Liaison

Feature liaison:: None

Work Items

Update the external server event API to check the target compute service version for volume-extended events.
Update the ComputeVirtAPI.extend_volume method to follow the behavior outlined in Compute Agent.
Add unit tests.
Adapt NFS job in the Nova gate to validate online extend.

Dependencies

The first two patches of [3] have been merged in 2024.1, so the os-extend_volume_completion action is now available in microversion 3.71 of the Block Storage API and supported in the 9.5.0 release of python-cinderclient.

Testing

We should test that the os-extend_volume_completion gets called correctly in all possible error or success condition if a volume has extending status.

We should test the case that the call to os-extend_volume_completion fails.

We also need to test that volume-extended continues to be handled correctly for volumes not in extending status.

Documentation Impact

The new behavior of the volume-extended event should be added to the documentation of the external server event API.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Accepted
2023.2 Bobcat	Accepted
2024.1 Caracal	Accepted
2024.2 Dalmatian	Reproposed

libvirt SPICE direct consoles

Sat, 06 Apr 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-spice-direct-consoles

Problem description

Use Cases

As a developer, I don’t want these changes to make the Nova codebase even more complicated. The changes proposed are relatively contained – a single new API microversion, some changes to the domain XML generation code which are optional, and associated tests.

As an end user, I would like access to a richer desktop experience than is currently available. Once these changes are integrated and Kerbside deployed, a further change to either Horizon or Skyline will be required to orchestrate console access via Kerbside. It is expected the complete end to end functionality will take several releases to land before a fully seamless experience is available. Once fully implemented, Horizon and Skyline will be capable of delivering a .vv configuration file for a specific console to a client, who will then have seamless access to their virtual desktop. However, a user will be able to use the openstack console url show command immediately to create a console session outside of our web clients.

Proposed change

The response from a get_spice_console or create call which requests a “spice-direct” console will return a URL derived from CONF.spice.kerbside_base_url, and will include a console access token. The user would then request this URL, and Kerbside would lookup console connection details from nova via the /os-console-auth-tokens/ API. These details would be used to generate a virt-viewer still .vv configuration file, which the user can then use to access a proxied SPICE console.

This specification also covers tweaks the to the libvirt domain XML to enrich the desktop experience provided by such a direct console, such as:

requiring an encrypted connection (WIP implementation at Ica7083b0836f8d66cad8a4b4097613103fc91560)
allowing concurrent users as supported by SPICE (WIP implementation at I65f94771abdc1a6ef54637ea81f25ce1daaf4963)
USB device passthrough from client to guest (WIP implementation at I0cbd28be272991f95c8fb9d76ee65b2b99a8bcf1)
sound support (WIP implementation at I4c98a0d6307c5e331df5caea80cb760512370058)

As part of prototyping this functionality, a series of patches to Nova were developed. These are available at https://github.com/shakenfist/kerbside-patches/tree/develop/nova as well as on gerrit at https://review.opendev.org/q/topic:%22kerbside-spice-direct-consoles%22.

They are:

Allow Nova to require secured SPICE connections, via a new require_secure configuration option in the SPICE configuration group.
Add an API microversion to expose the “spice-direct” console type.
Allowing concurrent connections to SPICE consoles for people who want to share a console session.
Supporting USB passthrough.
Optionally enabling SPICE debugging in qemu.
Adding a sound device so that the consoles can do audio. This will be done via a
Add an optional dependency in Nova to the Kerbside API client library so that Nova can fetch console connection details to proxy to a requesting user.

When implemented, a user can fetch a Kerbside connection URL like this:

The user then fetches that URL, and Kerbside delivers a .vv file with the connection information for a SPICE client. Kerbside uses a call to /os-console-auth-tokens/bf2e6883-… to determine the validity of the console authentication token, and the connection information for the console.

Alternatives

Data model impact

REST API impact

A new microversion is introduced which adds the type “spice-direct” to the existing “spice” protocol.

This implies that the JSON schema for create console call would change to something like this:

create_v297 = {
    'type': 'object',
    'properties': {
        'remote_console': {
            'type': 'object',
            'properties': {
                'protocol': {
                    'type': 'string',
                    'enum': ['vnc', 'spice', 'rdp', 'serial', 'mks'],
                },
                'type': {
                    'type': 'string',
                    'enum': ['novnc', 'xvpvnc', 'spice-html5',
                             'spice-direct', 'serial', 'webmks'],
                },
            },
            'required': ['protocol', 'type'],
            'additionalProperties': False,
        },
    },
    'required': ['remote_console'],
    'additionalProperties': False,
}

And that the JSON schema for the get_spice_console would change to something like this:

get_spice_console_v297 = {
    'type': 'object',
    'properties': {
        'os-getSPICEConsole': {
            'type': 'object',
            'properties': {
                'type': {
                    'type': 'string',
                    'enum': ['spice-html5', 'spice-direct'],
                },
            },
            'required': ['type'],
            'additionalProperties': False,
        },
    },
    'required': ['os-getSPICEConsole'],
    'additionalProperties': False,
}

The response from /os-console-auth-tokens/ also needs to be tweaked to return a TLS port if one is configured for the console, which will require a response schema change.

Security impact

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

The following configuration options are added by the proposed changes:

spice.kerbside_base_url: defaults to an example URL which wouldn’t actually work for a non-trivial installation (just as the HTML5 transcoding proxy does). This is the base URL for the Kerbside URLs handed out by Nova.
spice.require_secure: defaults to False, the current hard coded default. Whether to require secure TLS connections to SPICE consoles. If you’re providing direct access to SPICE consoles instead of using the HTML5 proxy, you may wish those connections to be encrypted. If so, set this value to True. Note that use of secure consoles requires that you setup TLS certificates on each hypervisor.
spice.allow_concurrent: defaults to False, the current hard coded default. Whether to allow concurrent access to SPICE consoles. SPICE supports multiple users accessing the same console simultaneously, with some reduced functionality for the second and subsequent users. Set this option to True to enable concurrent access to SPICE consoles.
spice.debug_logging: defaults to False, the current hard coded default. Whether to emit SPICE debug logs or not to the qemu log. These debug logs are verbose, but can help diagnose some connectivity issues.

The following additional image property will be added:

hw_audio_model: defaults to None, the current hard coded default. Whether to include a sound device for instance when SPICE consoles are enabled, and if so what type.

Additionally, if SPICE consoles are enabled, then USB passthrough devices are created in the guest. These devices are harmless if not used by a client capable of using USB passthrough.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: mikal
Other contributors:: None

Feature Liaison

Liaison needed.

Work Items

Land the patches at https://github.com/shakenfist/kerbside-patches/tree/develop/nova in the order specified there, with any modifications requested by the Nova team during code review.

Dependencies

None.

Testing

Documentation Impact

References

None.

History

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced

Allow Manila shares to be directly attached to an instance when using libvirt

Fri, 22 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-virtiofs-attach-manila-shares

Problem description

Use Cases

As an operator I want the Manila datapath to be separate to any tenant accessible networks.
As a user I want to attach Manila shares directly to my instance and have a simple interface with which to mount them within the instance.
As a user I want to detach a directly attached Manila share from my instance.
As a user I want to track the Manila shares attached to my instance.

Proposed change

Support for move operations once a share is attached will also not be covered by this spec, any requests to shelve, evacuate, resize, suspend, cold migrate or live migrate an instance with a share attached will be rejected with a HTTP409 response for the time being.

A new server shares API will be introduced under a new microversion. This will list current shares, show their details and allow a share to be attached or detached.

Note

The libvirt driver will be extended to support the above with initial support for cold attach and detach. Future work will aim to add live attach and detach as it is now supported by libvirt.

COMPUTE_STORAGE_VIRTIO_FS trait

and either the

COMPUTE_MEM_BACKING_FILE trait

that the instance is configured with hw:mem_page_size extra spec.

From an operator’s point of view, it means COMPUTE_STORAGE_VIRTIO_FS support requires that operators must upgrade all their compute nodes to the version supporting shares using virtiofs.

Users will be able to mount the attached shares using a mount tag, this is either the share UUID from Manila or a string provided by the users with their request to attach the share.

user@instance $ mount -t virtiofs $tag /mnt/mount/path

Share mapping status:

                     +----------------------------------------------------+   Reboot VM
    Start VM         |                                                    | --------------+
    Share mounted    |                       active                       |               |
+------------------> |                                                    | <-------------+
|                    +----------------------------------------------------+
|                      |                   |             |
|                      | Stop VM           |             |
|                      | Fail to umount    |             |
|                      v                   |             |
|                    +------------------+  |             |
|                    |      error       | <+-------------+-------------------+
|                    +------------------+  |             |                   |
|                      |                   |             |                   |
|                      | Detach share or   |             |                   |
|                      | delete VM         | Delete VM   |                   |
|                      v                   |             |                   |
|                    +------------------+  |             |                   |
|    +-------------> | detaching --> φ  | <+             |                   | Start VM
|    |               +------------------+                |                   | Fail to mount
|    |                 |                                 |                   |
|    | Detach share    |                                 | Stop VM           |
|    | or delete VM    | Attach share                    | Share unmounted   |
|    |                 v                                 v                   |
|    |               +----------------------------------------------------+  |
|    +-------------- |          attaching --> inactive                    | -+
|                    +----------------------------------------------------+
|                      |
+----------------------+

φ: means no entry in the database. No association between a share and a server.
Attach share: means POST /servers/{server_id}/shares
Detach share: means DELETE /servers/{server_id}/shares

This chart describe the share mapping status (nova), this is independent from the status of the Manila share.

Share attachment/detachment can only be done if the VM state is STOPPED or ERROR.

The operation to start a VM might fail if the attachment of an underlying share fails or if the share is not in an inactive state.

Note

Umount operation will be really done when the share is mounted and not used anymore by another server.

With the above mount and umount operation, the state is stored in memory and do not require a lookup in the database.

Instance Deletion Processes:

Standard Deletion:

During a normal deletion process on the compute side, both the unmount and Manila policy removal are attempted.
- If both operations succeed, the corresponding share mapping is also removed.
- If either the unmount or policy removal fails, the instance itself is deleted, but a share mapping record may remain in the database. A future enhancement will include a periodic task designed to unmount, remove the policy, and clean up any leaked share mappings.

Local Deletion:

When the VM is marked as DELETED in the database due to unavailable compute during the delete request, no unmounting or Manila policy removal occurs via the API.
- Once the compute is operational again, it identifies instances marked as DELETED that have not yet been cleaned up. During the initialization of the instance, the compute attempts to complete the deletion process, which includes unmounting the share and removing the access policy.
  - If these actions are successful, the share mapping will be removed.
  - If either action fails, the deletion remains incomplete; however, the compute’s startup process continues unaffected, and the error is merely logged. For security reasons, it’s crucial not to retain the mount, necessitating a retry mechanism for cleanup. This situation parallels the standard deletion scenario and requires a similar periodic task for resolution.

Manila share removal issue:

A solution was identified with the Manila team to attach metadata to the share access policy that will lock the share and prevent its deletion until the lock is not removed.

This solution was implemented in the Antelope cycle. The proposal here will use the lock mechanism in Nova.

Instance metadata:

Add instance shares in the instance metadata. Extend DeviceMetadata with ShareMetadata object containing shareId and tag used to mount the virtiofs on an instance by the user. See Other end user impact.

Alternatives

REST API impact

A new server level shares API will be introduced under a new microversion with the following methods:

GET /servers/{server_id}/shares

List all shares attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "shares": [
        {
            "shareId": "48c16a1a-183f-4052-9dac-0e4fc1e498ad",
            "status": "active",
            "tag": "foo"
        },
        {
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar"
        }
    ]
}

GET /servers/{server_id}/shares/{shareId}

Show details of a specific share attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar"
    }
}

PROJECT_ADMIN will be able to see details of the attachment id and export location stored within Nova:

{
    "share": {
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

POST /servers/{server_id}/shares

Attach a share to an instance.

Prerequisite(s):

Instance must be in the SHUTOFF state.
Instance should have the required capabilities to enable virtiofs (see above).

This API operates asynchronously. Consequently, the share_mapping is defined and it status is marked as “attaching” in the database.

Return Code(s): 202,400,401,403,404,409

Request body:

Note

tag will be an optional request parameter in the request body, when not provided it will be the shareId(UUID) as always provided in the request.

tag if povided by the user must be an ASCII string with a maximum lenght of 64 bytes.

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

Response body:

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

DELETE /servers/{server_id}/shares/{shareId}

Detach a share from an instance.

Prerequisite(s): Instance must be in the SHUTOFF or ERROR state.

This API functions asynchronously, leading to the share_mapping status being marked as detaching.

Concurrently, the compute system conducts a verification to see if the share is no longer being utilized by another instance. If found unused, it requests Manila to unlock the share and deny access.

Two checks are necessary:

To unmount, it’s important to verify whether any other virtual machines are using the share on the same compute system. This mechanism is already implemented by the driver.
For removing the access policy, we need to ensure that no compute system is currently using the share. Once this process is finalized, the association of the share is eliminated from the database.

Return Code(s): 202,400,401,403,404,409

Data model impact

A new share_mapping database table will be introduced.

id - Primary key autoincrement
uuid - Unique UUID to identify the particular share attachment
instance_uuid - The UUID of the instance the share will be attached to
share_id - The UUID of the share in Manila
status - The status of the share attachment within Nova (attaching, detaching, active, inactive, error)
tag - The device tag to be used by users to mount the share within the instance.
export_location - The export location used to attach the share to the underlying host
share_proto - The Shared File Systems protocol (NFS, CEPHFS)

A new base ShareMapping versioned object will be introduced to encapsulate the above database entries and to be used as the parent class of specific virt driver implementations.

Fields containing text will use String and not Text type in the database schema to limit the column width and be stored inline in the database.

This base ShareMapping object will provide stub attach and detach methods that will need to be implemented by any child objects.

New ShareMappingLibvirt, ShareMappingLibvirtNFS and ShareMappingLibvirtCephFS objects will be introduced as part of the libvirt implementation.

Security impact

Notifications impact

New notifications will be added:

One to add new notifications for share attach and share detach.
One to extend the instance update notification with the share mapping information.

Share mapping in the instance payload will be optional and controlled via the include_share_mapping notification configuration parameter. It will be disabled by default.

Proposed payload for attached and detached notification will be the same as the one returned by the show command with admin rights.

{
    "share": {
        "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

Proposed instance payload for instance updade, will be the list of share attached to this instance.

{
    "shares":
    [
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar",
            "export_location": "server.com/nfs_mount,foo=bar"
        },
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-ffffffffffff",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7987",
            "status": "active",
            "tag": "baz",
            "export_location": "server2.com/nfs_mount,foo=bar"
        }
    ]
}

Other end user impact

Users will need to mount the shares within their guestOS using the returned tag.

Users could use the instance metadata to discover and auto mount the share.

Performance Impact

Other deployer impact

None

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)
Other contributors:: lyarwood (initial contributor)

Feature Liaison

Feature liaison:: uggla

Work Items

Add new capability traits within os-traits
Add support within the libvirt driver for cold attach and detach
Add new shares API and microversion

Dependencies

None

Testing

Functional libvirt driver and API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Yoga	Introduced
Zed	Reproposed
Antelope	Reproposed
Bobcat	Reproposed
Caracal	Reproposed
Dalmatian	Updated and reproposed

Enforce remote console session timeout

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/enforce-remote-console-session-timeout

Currently providing vnc console consists 3 parts:

1 - Working Conosle for Nova instance.

Once a Nova instance is created in the hypervisor, the hypervisor itself provides a console without the need for additional installations within the instnace (as per nova.conf). To access the console, operators can use virsh console instance-xxx, which provides a serial console (character terminal access) and prompts the instance login console.

2 - Provide console access outside compute node via browser.

When user creates a console URL to access console via a web browser.

$ openstack console URL show <vm>

The cmd calls Nova API, the Nova API in turn, communicates through the RPC to compute service, which returns a new URL for connecting to an existing console.

The command does not create a new console but rather generates a URL for connecting to the existing console. This URL includes a token for authentication via the proxy.

This URL can be used to connect to the Nova instance console. The console token is used to athenticate with the proxy, enabling new sessions to be established until the token ttl expires. The existing session continue to function even after token expiration until the tcp connection is closed.

3- Controller’s Nova Proxy: Bridging Client Browser and Compute Node

When a user connects to the provided URL via a browser, the Nova Proxy acts as an intermediary. It establishes a WebSocket connection to the hypervisor and proxies the console to the client. For VNC consoles, the Nova Proxy serves an HTML page with a JavaScript application that runs at client side in the user’s browser, providing a VNC client experience. In the case of a serial console, the Nova Proxy provides a direct WebSocket connection without a pre-built client, allowing users to create their own clients that interact with the WebSocket.

                            [ Nova API, Compute, virt driver ]
[client browser] <======>                                       <======> [target virtual machine]
                                [ Nova proxy ]

                              Controller Node                                 Compute Node

Problem description

Today, there is no mechanism in place to enforce the termination of a console session when the console token expires. Users can continue to access the console beyond the token expiration, and there is a need to address this behavior to enhance security measures.

Use Cases

As an operator, I want to make sure that with console authentication TTL, console sessions get closed too, and hence the user should get disconnected from the console automatically.

Proposed change

Implement a timer mechanism to automatically close target socket connection from server side when token has expired based on exact token expiration time. This will interrupt the real time console session on client side browser or other application.

Also, introduce a new consoleauth config option enforce_session_timeout that allows operator to enable or disable the token expiry check. The default setting is disabled, with False as its default value. This gives flexibility to exisiting console user based on their specific requirements.

Alternatives

Client-side polling to check for token expiration. But as there are many vnc clients, its better to address the issue at server side to ensure a consistency in session timeout.

Data model impact

None

REST API impact

None

Security impact

This change enable strict time span for console access requiring, While it doesn’t inherently enhance the safety of console access, it ensures that users must reauthenticate after a specified time period.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

A new optional config option will be added.

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: auniyal

Feature Liaison

Feature liaison:: auniyal

Work Items

Update Nova webproxy code
tests

Dependencies

None

Testing

funtional

Documentation Impact

release notes

References

None

History

Revisions
Release Name	Description
2024.1 Caracal	Introduced

Ironic Shards

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/ironic-shards

Note

The deprecation for the [ironic]\peer_list config option, explained below in Config changes and Deprecations, was landed in 2023.2 (Bobcat). The rest of the feature was reverted due to a late-discovered bug and is being re-submitted in 2024.1 (Caracal).

Problem description

Nova’s Ironic driver involves a single nova-compute service managing many compute nodes, where each compute node record maps to an Ironic node. Some deployments support 1000s of ironic nodes, but a single nova-compute service is unable to manage 1000s of nodes and 1000s of instances.

Currently we support setting a partition key, where nova-compute only cares about a subset of ironic nodes, those associated with a specific conductor group. However, some conductor groups can be very large, servered by many ironic-conductor services.

To help with this, Nova has attempted to dynamically spread ironic nodes between a set of nova-compute peers. While this work some of the time, there are some major limitations:

when one nova-compute is down, only unassigned ironic nodes can move to another nova-compute service
i.e. when one nova-compute is down, all ironic nodes with nova instances associated with the down nova-compute service are unable to be managed, i.e. reboot will fail
moreover, when the old nova-compute comes back up, which might take some time, there are lots of bugs as the hash ring slowly rebalances. In part because every nova-compute fetches all nodes, in a large enough cloud, this can take over 24 hours.

This spec is about tweaking the way we shard Ironic compute nodes. We need to stop violating deep assumptions in the compute manager code by moving to a more static ironic node partitions.

Use Cases

Any users of the ironic driver that have more than one nova-compute service per conductor group should move to an active-passive failover mode.

The new static sharding will be of paritcular interest for clouds with ironic conductor groups that are greater than around 1000 baremetal nodes.

Proposed change

We add a new configuration option:

[ironic] shard_key

By default, there will be no shard_key set, and we will continue to expose all ironic nodes from a single nova-compute process. Mostly, this is to keep things simple for smaller deployments, i.e. when you have less than 500 ironic nodes.

When the operator sets a shard_key, the compute-node process will use the shard_key when querying a list of nodes in Ironic. We must never try to list all Ironic nodes when the Ironic shard key is defined in the config.

When we look up a specific ironic node via a node uuid or instance uuid, we should not restrict that to either the shard key or conductor group.

Similar to checking the instance uuid is still present on the Ironic node before performing an action, or ensuring there is no instance uuid before provisioning, we should also check the node is in the correct shard (and conductor group) before doing anything with that Ironic node.

Config changes and Deprecations

We will keep the option to target a specific conductor group, but this option will be renamed from partition_key to conductor_group. This is addative to the shard_key above, the target ironic nodes are those in both the correct shard_key and the correct conductor_group, when both are configured.

We will deprecate the use of the peer_list. We should log a warning when the hash ring is being used, i.e. when it has more than one member added to the hash ring.

In addtion, we need the logic that tries to move Compute Nodes to never work unless the peer_list is larger than one. More details in the data model impact section.

When deleting a ComputeNode object, we need to have the driver confirm that is safe. In the case of Ironic we will check to see if the configured Ironic has a node with that uuid, searching across all conductor groups and all shard keys. When the ComputeNode object is not deleted, we should not delete the entry in placement.

nova-manage move ironic node

We will create a new nova-manage command:

nova-manage ironic-compute-node-move <ironic-node-uuid> \
    --service <destination-service>

This command will do the following:

Find the ComputeNode object for this ironic-node-uuid
Error if the ComputeNode type does not match the ironic driver.
Find the related Service object for the above ComputeNode (i.e. the host)
Error if the service object is not reported as down, and has not also been put into maintanance. We do not require forced down, because we might only be moving a subset of nodes associated with this nova-compute service.
Check the Service object for the destination service host exists
Find all non-deleted instances for this (host,node)
Error if there is more than 1 non-deleted instance found. It is OK if we find zero or 1 instances.
In one DB transaction: move the ComputeNode object to the destination service host and move the Instance (if there is one) to the destination service host

The above tool is expected to be used as part of this wider process of migrating from the old peer_list to the new shard key. There are two key scearios (although the tool may help operator recover from other issues as well):

moving from a peer_list to a single nova-compute
moving from peer_list to shard_key, while keeping multiple nova-compute proccesses (for a single conductor group)

Migrate from peer_list to single nova-compute

Small deployments (i.e. less than 500 ironic nodes) are recommended to move from a peer_list of, for example, three nova-compute services, to a single nova-compute service. On failure of the nova-compute service, operators can either manually start the processes on a new host, or use an automatic active-passive HA scheme.

The process would look something like this:

ironic and nova both default to an empty_shard key by default, such that all ironic nodes are in the same default shard
start a new nova-compute service running the ironic driver, ideally with a syntheic value for [DEFAULT]host e.g. ironic This will log warnings about the need to use the nova-compute migration tool before being able to manage any nodes
stop all existing nova-compute services
mark them as forced-down via the API
Now loop around all ironic nodes and call this, assuming your nova-compute service has its host value of just ironic: nova_manage ironic-compute-node-move <uuid> –service ironic

The periodic tasks in the new nova-compute service will gradually pick up the new ComputeNodes, and will start being able to recieve commands such a reboot for all the moved instances.

While you could start the new nova-compute service after having migrated all the ironic compute nodes, but that would lead to higher downtime during the migration.

Migrate from peer_list to shard_key

The proccess to move from the hash key based peer_list to the static shard_key from ironic is very similar to the above process:

Set the shard_key on all your ironic nodes, such that you can spread the nodes out between your nova-compute processes,
Start your new nova compute processes, one for each shard_key, possibly setting a synthetic [DEFAULT]host value that matches the my_shard_key.
Shutdown all the older nova-compute processs with [ironic]peer_list set
Mark those older services as in maintainance via the Nova API
For each shard_key in Ironic, work out which service host you have mapped each one to above, then run this for each ironic node uuid in the shard: nova_manage ironic-compute-node-move <uuid> –service my_shard_key
Delete the old services via the Nova API, now there are no instances or compute nodes on those services

While you could start the new nova-compute services after the migration, that would lead to a slightly longer downtime.

Adding new compute nodes

In general, there is no change when adding nodes into existing shards.

Similarly, you can add a new nova-compute process for a new shard and then start to fill that up with nodes.

Move an ironic node between shards

When removing nodes from ironic at the end of their life, or adding large numbers of new nodes, you may need to rebalance the shards.

To move some ironic nodes, you need to move the nodes in groups associated with a specific nova-compute process. For each nova-compute and the associated ironic nodes you want to move to a different shard you need to:

Shutdown the affected nova-compute process
Put nova-compute services into in maintanance
In Ironic API update the shard key on the Ironic node
Now move each ironic node to the correct new nova-compute process for the shard key it was moved into: nova_manage ironic-compute-node-move <uuid> –service my_shard_key
Now unset maintanance mode for the nova-compute, and start that service back up

Move shards between nova-compute services

To move a shard between nova-compute services, you need to replace the nova-compute process with a new one:

ensure the destination nova-compute is configured with the shard you want to move, and is running
stop the nova-compute process currently serving the shard
force-down the service via the API
for each ironic node uuid in the shard call nova-manage to move it to the new nova-compute process

Alternatives

We could require nova-compute processes to be explicitly forced down, before allowing the nova-manage to move the ironic nodes about, in a similar way to evacuate. But this creates problems when trying to re-balance shards as you remove nodes at the end of their life.

We could consider a list of shard keys, rather than a single shard key per nova-compute. But for this first version, we have chosen the simpler path, that appears to have few limitations.

We could attempt to keep fixing the hash ring recovery within the ironic driver, but its very unclear what will break next due to all the deep assumptions made about the nova-compute process. The specific assumptions include:

when nova-compute breaks, its usually the hypervisor hardware that has broken, which includes all the nova servers running on that.
all locking and management of a nova server object is done by the currently assigned nova-compute node, and this is only ever changed by explict move operations like resize, migrate, live-migration and evacuate. As such we can use simple local locks to ensure concurrent operations don’t conflict, along with DB state checking.

Data model impact

A key thing we need to ensure is that ComputeNode objects are only automatically moved between service objects when in legacy hash ring mode. Currently, this only happens for unassigned ComputeNodes.

In this new explicit shard mode, only nova-manage is able to move ComputeNode objects. In addtion, nova-manage will also move associated instances. However, similar to evacuate, this will only be allowed when the currently associated service is forced down.

Note, this applies when a nova-compute finds a ComputeNode that is should own, but the Nova database says its already owned by a difference service. In this scenario, we should log a warning to the operator to ensure they have migrated that ComputeNode from its old location before this nova-compute service is able to manage it.

In addition, we should ensure we only delete a ComputeNode object when the driver explictly says its safe to delete. In the case of the Ironic driver, we should ensure the node no longer exists in Ironic, being sure to search across all shards.

This is all very related this spec on robustfying the Compute Node and Service object relationship: https://review.opendev.org/c/openstack/nova-specs/+/853837

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

Users will experience a more reliable Ironic and Nova integration.

Performance Impact

It should help users more easily support large ironic deployments integrated with Nova.

Other deployer impact

We will rename the “partition_key” configuration to be expliclity “conductor_group”.

We will deprecate the peer list key. When we start up and see anything set, we ommit a warning about the bugs in using this legacy auto sharding, and recomend moving to the explicit sharding.

There is a new shard_key config, as descirbed above.

There is a new nova_manage CLI command to move Ironic compute nodes on forced-down nova-compute services to a new one.

Developer impact

None

Upgrade impact

For those currenly using peer_list, we need to document how they can move to the new sharding approach.

Implementation

Assignee(s)

Primary assignee:: JayF
Other contributors:: johnthetubaguy

Feature Liaison

Feature liaison: None

Work Items

rename conductor group partition key config
deprecate peer_list config, with warning log messages
add compute node move and delete protections, when peer_list not used
add new shard_key config, limit ironic node list using shard_key
add nova-manage tool to move ironic nodes between compute services
document operational processes around above nova-manage tool

Dependencies

The deprecation of the peer list can happen right away.

But the new sharding depends on the Ironic shard key getting added: https://review.opendev.org/c/openstack/ironic-specs/+/861803

Ideally we add this into Nova after robustify compute node has landed: https://review.opendev.org/c/openstack/nova/+/842478

Testing

We need some functional tests for the nova-manage command to ensure all of the safety guards work as expected.

We need to ensure a tempest test exists which has multiple shards, with only one shard containing valid, functional Ironic nodes. Then, ensure that only the valid nodes are scheduled to.

Documentation Impact

A lot of docs needed for the Ironic driver on the operational procedures around the shard_key.

References

None

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced
2023.2 Bobcat	Re-proposed, Partially implemented
2024.1 Caracal	Re-proposed

Move to using Libvirt device aliases

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-dev-alias

Currently we identify devices in Libvirt guest XML by a variety of methods, which differs based on the device type (at least). Libvirt now provides a device alias mechanism by which we can tie virtual guest devices to an identifier we can use to look them up in a stable and generic way. Nova should move to using that, which will increase consistency, decrease some complexity, and also work around some issues with our current strategy.

Problem description

Nova currently looks up guest devices in XML for attach/detach and other modifications using a variety of methods. For example, disk devices use the serial property to identify them uniquely. However, libvirt and qemu do not support setting this property on all disk device types, which means Nova cannot use that to look up disk devices in a generic way. Further, if we have multiple network interfaces with the same MAC address, using that as a unique identifier is not sufficient.

Example volume attachment:

<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'/>
  <source dev='/dev/sda' index='5'/>
  <backingStore/>
  <target dev='vdb' bus='virtio'/>
  <serial>ada5af06-300e-4d07-931d-3cc2bff8a8a9</serial>
  <alias name='virtio-disk1'>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>

Use Cases

As a developer, I want Nova to be able to manage libvirt guest devices in a stable and consistent way.

As a deployer, I want Nova to support things like SCSI LUN passthrough, which does not support setting the device serial in libvirt.

Proposed change

Nova’s libvirt driver should move to using the device alias mechanism [1] for identifying all types of devices that are attach- or detach-able. For devices like volumes and network interfaces, the volume or port UUID should be used. For other devices, some other stable identifier that correlates to something in Nova or another service’s database is required. Libvirt has specific requirements for the format of the alias, which must be followed. However, for most devices that use a UUID as the primary identifier, we should be able to embed that within the alias.

This is what the above disk example would look like with a nova-specified alias:

<disk type='block' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'/>
  <source dev='/dev/sda' index='5'/>
  <backingStore/>
  <target dev='vdb' bus='virtio'/>
  <serial>ada5af06-300e-4d07-931d-3cc2bff8a8a9</serial>
  <alias name='ua-ada5af06-300e-4d07-931d-3cc2bff8a8a9'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>

Alternatives

We could keep what we have and continue to not support disk devices that do not support using serial.

We could maintain our own mapping in our database for those device types.

Data model impact

Nova’s own data model is not affected by this and this is limited to nova-compute and the libvirt driver. However, the libvirt XML data that we currently maintain will need to change (and existing instances migrated) to set the device aliases accordingly.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

End users can currently request SCSI LUN-based disk device mapping, but it does not work because we are unable to specify the device serial in that configuration. After this change, that existing mechanism will begin to work.

Performance Impact

No major performance impact to Nova itself, although looking up devices by alias will be easier and less computationally intense. Further a detach-by-alias routine [2] is provided by libvirt which may be significantly easier than what we currently need to do by generating and providing an XML blob for detach.

Other deployer impact

None.

Developer impact

The libvirt driver will ultimately be simpler after this change.

Upgrade impact

The only upgrade impact comes from migrating existing instance XML documents to specify the device alias. Because we may be migrating instances to/from older nodes, we should retain compatibility with alias-less XMLs for some time to come.

Implementation

Assignee(s)

Primary assignee:

dansmith

Other contributors:

kashyap
sean-k-mooney

Feature Liaison

dansmith

Work Items

Enable setting and parsing the device alias on disk, interface, and pci devices
Actually set those device aliases in the various parts of the driver that create those configs
Make the code that looks up devices by device-specific identifiers prefer the alias and fall back to the old way
Migrate existing instance XMLs on startup when device aliases are missing

Dependencies

Libvirt 3.9.0: https://libvirt.org/formatdomain.html#devices

Testing

Existing devstack jobs should provide sufficient coverage other than the unit and functional coverage that will be added. Potentially enabling (and using) the LUN passthrough attachment mechanism would be beneficial, but that is somewhat beyond the scope of this effort which is just changing the enumeration behavior.

Documentation Impact

There really is not much in the way of documentation impact because this should be transparent to the operators and users.

References

History

Revisions
Release Name	Description
2024.1	Introduced

Add maxphysaddr support for Libvirt

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-maxphysaddr-support

This blueprint propose new flavor extra_specs and image properties to control the physical address bits of vCPUs in Libvirt guests.

Problem description

When booting a guest with 1TB+ RAM, the default physical address bits are too small and the boot fails [1]. So a knob is needed to specify the appropriate physical address bits.

Use Cases

Booting a guest with large RAM.

Proposed change

In Libvirt v8.7.0+ and QEMU v2.7.0+, physical address bits can be specified with following XML elements [2] [3]. The former means to adopt any physical address bits, the latter means to adopt the physical address bits of the host CPU.

<maxphysaddr mode='emulate' bits='42'/>
<maxphysaddr mode='passthrough'/>

Flavor extra_specs and image properties

Here I suggest the following two for flavor extra_specs and image properties. Of course, if these are omitted, the behavior is the same as before.

hw:maxphysaddr_mode can be either emulate or passthrough.
hw:maxphysaddr_bits takes a positive integer value. Only meaningful and must be specified if hw:maxphysaddr_mode=emulate.

So the overall flavor extra_specs look like the following:

openstack flavor set <flavor> \
  --property hw:maxphysaddr_mode=emulate \
  --property hw:maxphysaddr_bits=42

Also the same, but the overall image properties look like the following:

openstack image set <image> \
  --property hw_maxphysaddr_mode=emulate \
  --property hw_maxphysaddr_bits=42

Nova scheduler changes

Nova scheduler also needs to be modified to take these two properties into account.

hw:maxphysaddr_mode

There can be a mix of supported and unsupported hosts depending on Libvirt and QEMU versions. So add new traits COMPUTE_ADDRESS_SPACE_PASSTHROUGH and COMPUTE_ADDRESS_SPACE_EMULATED to check the scheduled host supports this feature. trait:COMPUTE_ADDRESS_SPACE_PASSTHROUGH=required is automatically added if hw:maxphysaddr_mode=passthrough is specified in flavor extra_specs or image properties. And same for hw:maxphysaddr_mode=emulate. This can be implemented inside the from_request_spec method of ResourceRequest class.

Passthrough and emulate modes have different properties. So let’s consider the two separately.

The case of hw:maxphysaddr_mode=passthrough. In this case, cpu_mode=host-passthrough is a requirement, which is already taken into account in nova scheduling, and no additional modifications are required in this proposal. It is not guaranteed whether the instance can be migrated by nova. So the admin needs to make sure that targets of cold and live migration have similar hardware and software. This restriction is similar for cpu_mode=host-passthrough.

The case of hw:maxphysaddr_mode=emulate. In nova scheduling, it is necessary to check that the hypervisor supports at least hw:maxphysaddr_bits. Numerical comparison is implemented differently for flavor extra_specs and image properties, so it is divided into two cases.

hw:maxphysadr_bits

The maximum number of bits supported by hypervisor can be obtained by using libvirt capabilities [4].

If hw:maxphysaddr_bits is set to flavor extra_specs, ComputeCapabilitiesFilter can be used to compare the number of bits in scheduling. For example, this can be accomplished by adding capabilities:cpu_info:maxphysaddr:bits>=42 automatically.

If hw_maxphysaddr_bits is set to image properties, perform a numeric comparison with ImagePropertiesFilter.

Cold migration and live migration can also be realized with these filter and COMPUTE_ADDRESS_SPACE_EMULATED trait.

Alternatives

Before the maxphysaddr option was introduced into Libvirt, it was specified as a workaround with the QEMU comanndline parameter. But this alternative is not allowed in nova.

Also, some Linux distributions may have machine types with host-phys-bits=true [5]. For example, pc-i440fx-bionic-hpb and pc-q35-bionic-hpb. However, this alternative has following two issues and cannot be adopted for general-purpose use cases.

Ubuntu package maintainers are applying a patch to QEMU [6]. It means this is not included in vanilla QEMU and is not available in other distributions.
This is only the case for hw:maxphysaddr_mode=passthrough and does not include hw:maxphysaddr_mode=emulate. Since hw:maxphysaddr_mode=passthrough requires cpu_mode=host-passthrough to be used [7], this alternative cannot be used with cpu_mode=custom or cpu_mode=host-model. So, this alternative is not sufficient for a cloud with many different CPU models.

As for scheduling, placement does not currently support numeric traits, so the maximum number of bits supported by hypervisor cannot be checked by this mechanism. Numeric comparisons can also be performed with JsonFilter. However, JsonFilter appears to be vulnerable to changes in HostState and its child attributes, which is mentioned as a warning [10]. So this spec employs ComputeCapabilitiesFilter and ImagePropertiesFilter.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Operators should specify appropriate flavor extra_specs or image properties as needed.

Developer impact

None

Upgrade impact

As described earlier, the new traits COMPUTE_ADDRESS_SPACE_PASSTHROUGH and COMPUTE_ADDRESS_SPACE_EMULATED signal if the upgraded compute nodes support this feature.

Implementation

Assignee(s)

Primary assignee:: nmiki
Other contributors:: None

Feature Liaison

Feature liaison:: Liaison Needed

Work Items

This spec is addressed across multiple dev cycles. The merged and missing items are shown below, respectively.

Merged Items

Add new traits to check Libvirt and QEMU versions [8] [9]

Missing Items

Add new guest configs
Add new fileds in nova/api/validation/extra_specs/hw.py
Add new fileds in nova/objects/image_meta.py
Add new fields in LibvirtConfigCPU in nova/virt/livbirt/config.py
Add new field maxphysaddr to cpu_info in nova/virt/libvirt/driver.py
Add docs and release notes for new flavor extra_specs
Support for hw:maxphysadar_bits numeric comparison in ComputeCapabilitiesFilter
Support for hw_maxphysaddr_bits numeric comparison in ImagePropertiesFilter

Dependencies

Libivrt v8.7.0+. QEMU v2.7.0+.

Testing

Add the following unit tests:

check that proposed flavor extra_specs are properly validated
check that proposed image properties are properly validated
check that intended XML elements are output
check that traits are properly added and used
check that new field in ComputeCapabilitiesFilter is property added and used
check that new field in ImagePropertiesFilter is property added and used

Documentation Impact

For operators, the documentation describes what proposed flavor extra_specs and image properties mean and how they should be set.

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced
2023.2 Bobcat	Reproposed
2024.1 Caracal	Reproposed

Mediated device live migration with libvirt

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-mdev-live-migrate

Starting with libvirt-8.6.0, QEMU-8.1.0 and Linux kernel 5.18.0, guests using mediated devices can be live migrated by using a target mediated device using the same mediated device type (and we don’t need to unplug/plug the mdevs). Now, we need to support this for Nova, which means that Nova should provide a target mediated device UUID (that exists) to the source compute service by the pre-live-migrating call so the target XML created by the source would use it.

Problem description

For the moment, this is not possible to live-migrate an instance if it uses a mediated device as the target wouldn’t create it. You can only for the moment cold-migrate the instance or do other move operations like shelve. Fortunately, libvirt 8.5.0 now supports to live-migrate a guest by using a target mediated device uuid in the target XML so we want to directly support this in Nova.

Use Cases

As an operator, I want to move my instance using a vGPU to another host without the user being aware of it.

As an operator, I want to make sure I will only live-migrate by using the same mediated device type between the source and the target.

Proposed change

In order to succesfully live-migrate a guest with libvirt, you need to modify the target guest XML to use another mediated device using the same mdev (mediated device) type. In order to do it, we propose the following workflow :

First, during the conductor compatibility checks, we will verify the types compatibility on the destination and we will claim for a specific list of target mediated devices (either to be created or just kept reserved) this way :

check_can_live_migrate_source() (run on the source) will check the libvirt version of the source and fail by raising a MigrationPreCheckError if the version if below the minimum required (see Dependencies) and only if the instance has mediated devices. It will also check the LibvirtLiveMigrateData version returned by the destination and will raise a MigrationPreCheckError exception if older than the one supporting the new fields (see both Upgrade Impact and Data model impact). Eventually, it will return the list of number of mdevs with their types back to the target in the LibvirtLiveMigrateData object.
driver’s post_claim_migrate_data() will first check based on the LibvirtLiveMigrateData object whether the libvirt version is below the minimum required and then check whether those mdev types are compatible with the types the target supports and will raise a MigrationPreCheckError if not. If successful, it will pick N (N being the requested number) of the available mediated resources (either by creating new mdevs or taking existing ones), based on the list that was passed thru LibvirtLiveMigrateData, and will persist that list of target mediated devices in some internal dictionary field of the LibvirtDriver instance, keyed by the instance UUID. We will also pass those mdev uuids in the LibvirtLiveMigrateData object that we return over the wire to the source compute (we will call it later migrate data object).

Note

the current spec proposal is to use the existing NUMA-live-migration related method called post_claim_migrate_data() but we could create a specific new virt driver API method for this usage. This will be discussed at implementation stage.

Later, once the source host starts the live-migration, we will update the guest XML information with those mediated device UUIDs this way :

in source’s driver _live_migration_operation() we lookup the migrate data object we got and we update the target guest XML in get_updated_guest_xml() by getting those mediated device UUIDs from the migrate data object.
in destination’s driver post_live_migration_at_destination(), we delete the mdevs tracked in the internal dictionary field of the LibvirtDriver instance by getting them from the dictionary which is keyed by the instance UUID.

In case of any live migration abort or exception, the residue we only need to clean up is basically the list of claimed mediated devices for the migration that are set in the dictionary field of the LibvirtDriver instance. Accordingly, we propose to delete those records this way :

if the exception occurred during pre-livemigration, it eventually calls on destination rollback_live_migration_at_destination() depending on _live_migration_cleanup_flags() result. We will modify that verification method to lookup whether we have mediated device UUIDs in the migrate data object. Then, rollback_live_migration_at_destination() will again look at the dictionary to know which mediated devices to remove from the internal dictionary in the LibvirtDriver.
if the exception happened during the live-migration (or if the operator asked to abort it), then it eventually calls _rollback_live_migration() which also calls rollback_live_migration_at_destination() like above, so it would also remove the mdevs from the LibvirtDriver dictionary field.

As a side note, the current method we have for knowing which mediated devices are used by instances will be modified to also take in account the list of mediated devices that are currently set in internal directory field of the LibvirtDriver we’ll be using for tracking which mdevs are claimed for migrations.

Alternatives

Operators could continue to only do cold migrations or we could try to unplug and then plug mediated devices during live-migration like we do at the moment for SR-IOV VFs.

Data model impact

While we won’t describe the internal dictionary we would use in the LibvirtDriver class instance as this is just an implementation question, we still need to explain which objects will be passed between computes RPC services. As we said earlier, we need to augment the LibvirtLiveMigrateData object.

New fields will be added on that object (we can create a nested object if people prefer):

source_mdev_types: fields.DictOfStringsField() : dictionary where the key is a source mediated device UUID and the value is its mdev type.
target_mdevs: fields.DictOfStringsField() : dictionary where the key is a mediated device UUID of the source and the value a mdev UUID of the target, implicitly matching the relationship between both for the live-migration.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

Operators wanting to use vGPU live-migration will need to support a recent libvirt release, so they probably need to upgrade their OS. They will also need to upgrade all their compute services, see Upgrade Impact for more details.

Developer impact

None.

Upgrade impact

Operators will need to make sure that the target computes are upgraded. That said, given if the destination is not upgraded (and then doesn’t support live migration), then it would return a LibvirtLiveMigrateData object with a previous version. The source will know that the target doesn’t support it and will accordingly raise MigrationPreCheckError (we detailed that above in Proposed change).

Implementation

Assignee(s)

Primary assignee:: sylvain-bauza
Other contributors:: None

Feature Liaison

N/A

Work Items

add the LibvirtDriver internal dictionary
augment the LibvirtLiveMigrateData object
add the conductor checks
add the live-migration changes

Dependencies

As said above, it requires : - libvirt-8.6.0 and newer - QEMU-8.1.0 and newer - Linux kernel 5.18.0 and newer

Testing

Unit and functional tests are a very bare minimum but we’re actively chasing the idea to use the mtty kernel samples framework as a way to do some Tempest testing that’s yet unwritten. We may need to build a custom kernel in order to get the latest version of mtty that includes live-migration support.

Documentation Impact

We’ll augment the usual virtual GPU documentation with a section on how to live-migrate and its requirements.

As a note, the specific proprietary nVidia vfio-mdev driver that provides mediated device types and live-migration support currently has limitations and doesn’t support pausing a VM and autoconverge feature. Besides, live-migration downtime is very depending on the hardware so we somehow need to document those hardware-specific knobs in some abstract manner in our upstream docs, pointing as much as we can to the vendor documentation if existing.

References

History

Revisions
Release Name	Description
2024.1 Caracal	Introduced

List requested Availability Zones

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/list-requested-az

Currently server show and server list –long output, displays the current AZ of the instance. That is, the AZ to which the host of the instance belongs. There is no way to tell from this information that whether the instance create request included an AZ or not.

This implementation enables users to validate that their request for Availability Zone was correctly processed and satisfied, by returning back information, not only about current placement of the instance, but also original request.

Problem description

As of today, the server show and server list –long output, displays the current AZ of the instance. That is, the AZ to which the host of the instance belongs. There is no way to tell from this information that whether the instance created request included an AZ or not.

Also when cross_az_attach option is False and booting an instance from volume, the instance can be pinned to AZ and in that case, instance will be scheduled on host belonging to pinned AZ.

Also when default_schedule_zone config option set to specific AZ, in that case, instance would be pinned to that specific AZ, and instance will be scheduled on host belonging to pinned AZ.

Use Cases

As an operator, I want to know if the instance create request asked for an AZ expliclity or not. And whether the requested AZ and current AZ are both same or different.

Proposed change

The instances table from nova cell database does not have requested availability zone information. The same can be get from request_specs table in nova_api database.

For server show output, use the existing get_by_instance_uuid method from RequestSpec object and display it in the output.

For server list –long output, implement a method get_by_instance_uuids for RequestSpec object, which takes list of instance uuids of instances which will be shown in the listed output and return a list of RequestSpec objects of those instances.

Alternatives

As an alternative, we could add the requested availability zone information in instances table and when doing server list –long or server show use the data from instances table only and display to users, but it would duplicate the data in request_specs table as well as in instances table.

Data model impact

For implementation, we need to add a method get_by_instance_uuids to the RequestSpec object, which takes list of instance uuids as input and returns list of RequestSpec objects of those instances.

REST API impact

This change will be done with a new microversion bump.

Below are the two APIs that will be changed.

GET /servers/{server_id}

server show api response will include availability zone requested during server creation.

{
  "server":
      {
          ...
          "pinned_availability_zone": None
          ...
      }
}

GET /servers/detail

server list –long api response will include availability zone requested during server creation.

{
  "servers": [
      {
          ...
          "pinned_availability_zone": None
          ...
      }
      {
          ...
          "pinned_availability_zone": None
          ...
      }
  ]
}

Security impact

None

Notifications impact

None

Other end user impact

Performance Impact

There will be minor performance impact when user calls server list –long because we will be adding another database call to get list of request_specs of instances.

Other deployer impact

None

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: ratailor

Feature Liaison

Feature liaison:: ratailor

Work Items

Implement API changes
Add tests

Dependencies

openstackclient and openstacksdk needs to be updated to implement this change.

Testing

Add unit tests
Add functional tests (API samples)

Documentation Impact

The api-ref will be updated to reflect the changes.

References

None

History

Revisions
Release Name	Description
2024.1 Caracal	Introduced

VirtIO PackedRing Configuration support

Wed, 13 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/virtio-packedring-configuration-support

This blueprint proposes to expose the LibVirt packed option that allows a guest to negotiate support for the VirtIO packed-ring feature. This blueprint is used to solicit community’s input.

Problem description

VM using a Virtio-net paravirtual network device uses Virtual queues (virtqs) to send and receive data between the virtio-net driver and the virtual or physical backed. The VirtIO standard originally defined a single type of virtq called split-ring queue. The latest edition of the standard (v1.1) adds a different type of the virtq, called packed-ring queue. A different layout of queue elements allows to increase the performance in both virtual and physical backeds.

Split-ring support is the default option in VirtIO. Backends supporting the packed-ring virtqs advertise this by setting the VIRTIO_F_RING_PACKED feature bit during the feature negotiation. A guest driver then chooses the virtq layout based on what it supports. As both options are identical features wise, and the packed-ring is more efficient, the latter is typically chosen.

QEMU added support for the packed virtqs in v4.2 and LibVirt in v6.3. Qemu and LibVirt supports the packed-ring virtqs via the packed option. However, note that this option does not force the VM to use the packed-ring virtq. It acts as a mask, allowing the backed to advertise the support when set. The driver in the VM is still responsible for choosing the layout of virtqs.

This blueprint proposes to add a Nova flavor extra_spec and Glance image property, that sets the packed option to true on the node. This way all VMs running on the node are allowed to choose the virtq layout based on what is offered by the backed, rather than being froced to use split-ring.

Use Cases

As an operator, I want to benefit from the increase in the virtio-net performance, by using a more efficient virtq structure.

Proposed change

Add hw_virtio_packed_ring for image property and hw:virtio_packed_ring for flavor extra specs. Users will control the packed virtqueue feature, and be able to disable it if desired.

hw_virtio_packed_ring=true|false (default false) hw:virtio_packed_ring=true|false (default false)
Provide new compute COMPUTE_NET_VIRTIO_PACKED capablity trait. This trait can be required/forbidden by user. Nova-compute agent will automatically set this trait to the resource provider summary as far as current minimal libvirt support version by OpenStack is higher than feature required version 6.3.
This spec will update scheduling process. ALL_REQUEST_FILTERS will be extended with new filter packed_virtqueue_filter. It will update RequestSpec with new trait in case if image property or flavor extra_spec is enabled to avoid migration to the node without packed virtqueue feature support.

Alternatives

Leave as-is, operator will not have additional performance impact.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

VMs using virtio-net will see an increase in performance. The increase can be anywhere between 10/20% (see DPDK Intel Vhost/virtio perf. reports) and 75% (using Napatech SmartNICs).

Other deployer impact

None

Developer impact

None

Upgrade impact

If operators upgrade their computes one by one, only new upgraded computes will support that feature for a while.

Implementation

Assignee(s)

Primary assignee:: justas_napa on IRC and Gerrit
Additional assignees:: dvo-plv on IRC and Gerrit

Feature Liaison

Sean Mooney (sean-k-mooney)

Work Items

Add image property and flavor extra specs
Provide new compute capability trait
Update scheduling process

Dependencies

None

Testing

New Unit and Functional tests will be added:

Verify image property and flavor extra spec options for correct configuration.
Verify nodes filtering by trait.
Verify correct Libvirt xml with driver packet option configuration.

Documentation Impact

Configuration options reference will require an update.

References

VirtIO standard: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html
LibVirt Domain XML reference https://libvirt.org/formatdomain.html#virtio-related-options

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced
2023.2 Bobcat	Accepted
2024.1 Caracal	Reproposed

libvirt driver support for flavor and image defined ephemeral encryption

Tue, 05 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-encryption-libvirt

This spec outlines the specific libvirt virt driver implementation to support the Flavor and Image defined ephemeral storage encryption [1] spec.

Problem description

The libvirt virt driver currently provides very limited support for ephemeral disk encryption through the LVM imagebackend and the use of the PLAIN encryption format provided by dm-crypt.

As outlined in the Flavor and Image defined ephemeral storage encryption [1] spec this current implementation is controlled through compute host configurables and is transparent to end users, unlike block storage volume encryption via Cinder.

With the introduction of the Flavor and Image defined ephemeral storage encryption [1] spec we can now implement support for encrypting ephemeral disks via images and flavors, allowing support for newer encryption formats such as LUKSv1. This also has the benefit of being natively supported by QEMU, as already seen in the libvirt driver when attaching LUKSv1 encrypted volumes provided by Cinder.

Use Cases

As a user of a cloud with libvirt based computes I want to request that all of my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user of a cloud with libvirt based computes I want to be able to pick how my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user I want each encrypted ephemeral disk attached to my instance to have a separate unique secret associated with it.
As an operator I want to allow users to request that the ephemeral storage of their instances is encrypted using the flexible LUKSv1 encryption format.

Proposed change

Deprecate the legacy implementation within the libvirt driver

The legacy implementation using dm-crypt within the libvirt virt driver needs to be deprecated ahead of removal in a future release, this includes the following options:

[ephemeral_storage_encryption]/enabled
[ephemeral_storage_encryption]/cipher
[ephemeral_storage_encryption]/key_size

Limited support for dm-crypt will be introduced using the new framework before this original implementation is removed.

Populate disk_info with encryption properties

The libvirt driver has an additional disk_info dict built from the contents of the previously mentioned block_device_info and image metadata associated with an instance. With the introduction of the DriverImageBlockDevice within the Flavor and Image defined ephemeral storage encryption [1] spec we can now avoid the need to look again at image metadata while also adding some ephemeral encryption related metadata to the dict.

This dict currently contains the following:

disk_bus: The default bus used by disks
cdrom_bus: The default bus used by cd-rom drives
mapping: A nested dict keyed by disk name including information about each disk.

Each item within the mapping dict containing following keys:

bus: The bus for this disk
dev: The device name for this disk as known to libvirt
type: A type from the BlockDeviceType enum (‘disk’, ‘cdrom’,’floppy’, ‘fs’, or ‘lun’)

It can also contain the following optional keys:

format: Used to format swap/ephemeral disks before passing to instance (e.g. ‘swap’, ‘ext4’)
boot_index: The 1-based boot index of the disk.

In addition to the above this spec will also optionally add the following keys for encrypted disks:

encryption_format: The encryption format used by the disk
encryption_options: A dict of encryption options
encryption_secret_uuid: The UUID of the encryption secret associated with the disk
backing_encryption_secret_uuid: The UUID of the encryption secret for the backing file associated with the disk in the case of qcow2.

Handle ephemeral disk encryption within imagebackend

With the above in place we can now add encryption support within each image backend. As highlighted at the start of this spec this initial support will only be for the LUKSv1 encryption format.

Generic key management code will be introduced into the base nova.virt.libvirt.imagebackend.Image class and used to create and store the encryption secret within the configured key manager. The initial LUKSv1 support will store a passphrase for each disk within the key manager. This is unlike the current ephemeral storage encryption or encrypted volume implementations that currently store raw keys in the key manager due to their support of the dm-crypt PLAIN encryption format. With LUKSv1 it is not necessary to store raw keys as it does not directly encrypt data with the provided key [1].

The base nova.virt.libvirt.imagebackend.Image class will also be extended to accept and store the optional encryption details provided by disk_info above including the format, options and secret UUID(s).

Each backend will then be modified to encrypt disks during nova.virt.libvirt.imagebackend.Image.create_image using the provided format, options and secret.

Enable the `COMPUTE_EPHEMERAL_ENCRYPTION_LUKS` trait

Finally, with the above support in place the COMPUTE_EPHEMERAL_ENCRYPTION and COMPUTE_EPHEMERAL_ENCRYPTION_LUKS traits can be enabled when using a backend that supports LUKSv1. This will in turn enable scheduling to the compute of any user requests asking for ephemeral storage encryption using the format.

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

As discussed above the ephemeral encryption keys will be added to the disk_info for individual disks within the libvirt driver.

REST API impact

N/A

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

QEMU will natively decrypt these LUKSv1 ephemeral disks for us using the libgcrypt library. While there have been performance issues with this in the past workarounds [2] can be implemented that use dm-crypt instead.

Other deployer impact

N/A

Developer impact

This spec will aim to implement LUKSv1 support for all imagebackends but in the future any additional encryption formats supported by these backends will need to ensure matching traits are also enabled.

Upgrade impact

The legacy implementation is deprecated but will continue to work for the time being. As the new implementation is separate there is no further upgrade impact.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Populate the individual disk dicts within disk_info with any ephemeral encryption properties.
Provide these properties to the imagebackends when creating each disk.
Introduce support for LUKSv1 based encryption within the imagebackends.
Enable the COMPUTE_EPHEMERAL_ENCRYPTION_LUKS trait when the selected imagebackend supports LUKSv1.

Dependencies

Flavor and Image defined ephemeral storage encryption [1]

Testing

Unlike the parent spec once imagebackends support LUKSv1 and enable the required trait we can introduce Tempest based testing of this implementation in addition to extensive functional and unit based tests.

Documentation Impact

New user documentation around the specific LUKSv1 support for ephemeral encryption within the libvirt driver.
Reference documentation around the changes to the virt block device layer.
Document that for the raw imagebackend, both [libvirt]images_type = raw and [DEFAULT]use_cow_images = False must be configured in order for resize to work. This is also true without encryption but it may still be helpful to users.
Document that a user must have policy permission to create secrets in Barbican in order for encryption to work for that user. Secrets are created in Barbican using the user’s auth token. Admins have permission to create secrets in Barbican by default.

References

Revisions
Release Name	Description
Wallaby	Introduced
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed
2023.2 Bobcat	Reproposed
2024.1 Caracal	Reproposed
2024.2 Dalmatian	Reproposed

Flavor and Image defined ephemeral storage encryption

Tue, 05 Mar 2024 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-storage-encryption

This spec outlines a new approach to ephemeral storage encryption in Nova allowing users to select how their ephemeral storage is encrypted at rest through the use of flavors with specific extra specs or images with specific properties. The aim being to bring the ephemeral storage encryption experience within Nova in line with the block storage encryption implementation provided by Cinder where user selectable encrypted volume types are available.

Note

This spec will only cover the high level changes to the API and compute layers, implementation within specific virt drivers is left for separate specs.

Problem description

At present the only in-tree ephemeral storage encryption support is provided by the libvirt virt driver when using the lvm imagebackend. The current implementation provides basic operator controlled and configured host specific support for ephemeral disk encryption at rest where all instances on a given compute are forced to use encrypted ephemeral storage using the dm-crypt PLAIN encryption format.

This is not ideal and makes ephemeral storage encryption completely opaque to the end user as opposed to the block storage encryption support provided by Cinder where users are able to opt-in to using admin defined encrypted volume types to ensure their storage is encrypted at rest.

Additionally the current implementation uses a single symmetric key to encrypt all ephemeral storage associated with the instance. As the PLAIN encryption format is used there is no way to rotate this key in-place.

Use Cases

As a user I want to request that all of my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As a user I want to be able to pick how my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As an admin/operator I want to either enforce ephemeral encryption per flavor or per image.
As an admin/operator I want to provide sane choices to my end users regarding how their ephemeral storage is encrypted at rest.
As a virt driver maintainer/developer I want to indicate that my driver supports ephemeral storage encryption using a specific encryption format.
As a virt driver maintainer/developer I want to provide sane default encryption format and options for users looking to encrypt their ephemeral storage at rest. I want these associated with the encrypted storage until it is deleted.

Proposed change

To enable this new flavor extra specs, image properties and host configurables will be introduced. These will control when and how ephemeral storage encryption at rest is enabled for an instance.

Note

The following hw_ephemeral_encryption image properties do not relate to if an image is encrypted at rest within the Glance service. They only relate to how ephemeral storage will be encrypted at rest when used by a provisioned instance within Nova.

Allow ephemeral encryption to be configured by flavor, image, or config

To enable ephemeral encryption per instance the following boolean based flavor extra spec and image property will be introduced:

hw:ephemeral_encryption
hw_ephemeral_encryption

The above will enable ephemeral storage encryption for an instance but does not control the encryption format used. For this, a configuration option will be used to provide a default format per compute which will initially default to luks with no other choices at this time.

[ephemeral_storage_encryption]/default_format

To enable snapshot and shelve of instances using ephemeral encryption, the UUID of the encryption secret and the encryption format for the resultant image will be kept with the image using the standardized Glance image properties [1]:

os_encrypt_key_id
os_encrypt_format

The secret UUID and encryption format are needed when creating an instance from an ephemeral encrypted snapshot or when unshelving an ephemeral encrypted instance.

The other os_encrypt* Glance image properties will also be set at the time of snapshot:

os_encrypt_cipher - the cipher algorithm, e.g. ‘AES256’
os_encrypt_key_deletion_policy - on image deletion indicates whether the key should be deleted too
os_decrypt_container_format - format change, e.g. from ‘compressed’ to ‘bare’
os_decrypt_size - size after payload decryption

Possible future work

In the future, we could consider supporting a cloud with a mix of compute hosts providing either LUKSv1 (qcow2|raw|rbd) or legacy dm-crypt PLAIN (LVM) encryption formats.

The encryption format used would be controlled by the following flavor extra specs and image properties:

hw:ephemeral_encryption_format
hw_ephemeral_encryption_format

and would be used to schedule to a compute host which supports the specified format.

The format would be provided as a string that maps to a BlockDeviceEncryptionFormatTypeField oslo.versionedobjects field value:

legacy_dmcrypt_plain for the dm-crypt PLAIN format
luks for the LUKSv1 format

and if neither are specified, the format would be taken from the os_encrypt_format [1] if the source image is encrypted. If the source image is not encrypted, the format would be taken from [ephemeral_storage_encryption]/default_format after an instance lands on a compute host.

Management of secret data with the Key Manager service

The passphrases of encrypted disks are managed using a Key Manager service such as Barbican.

Nova will create, retrieve, and delete disk passphrases using the authorization token of the user calling Nova API. The cloud operator must consider the implications of secret ownership with regard to server actions and who is allowed to perform them:

  ┌─────────────────────┐                        ┌────────────────────┐
  │                     │                        │                    │
  │                     │                        │                    │
  │       Nova API      │◄───────────────────────┤    Barbican API    │
  │                     │                        │                    │
  │                     ├─────┬────────────┬────►│                    │
  │                     │     │ User token │     │                    │
  │                     │     └────────────┘     │                    │
  │                     │                        │                    │
  └──────────▲──────────┘                        └────────────────────┘
             │
             │
             │
             │
             │
┌────────────┤
│ User token │
└────────────┤
             │
             │
             │
             │
        ┌─────────┐
        │         │
        │  User   │
        │         │
        └─────────┘

By default, Barbican scopes the ownership of a secret at the project level. This means that many calls in the Barbican API will perform an additional check to ensure that the project_id of the token matches the project_id stored as the secret owner. Users who are members of the same project have access to each other’s secrets in this configuration.

For admin-only APIs such as cold migrate, live migrate, and evacuate, the user calling Nova API to perform these server actions needs both:

Access the Barbican secrets of the owner of the server
The admin role in order to call admin-only Nova APIs such as cold migration, live migration, evacuate, etc

In a default Barbican configuration, secret ownership will be scoped to the project which created it, so in such an environment a user would need to be a project administrator or any user who has both project membership and the admin role.

Note that it is possible for cloud operators to implement more fine-grained control of secrets in Barbican using access control lists. Secrets could be made to be scoped at the user level, for example, instead of at the project level. In such a configuration, a project administrator, would not be allowed perform admin-only API server actions on a server belonging to a different user in the project.

Operators must plan ahead to determine what configuration and access control of Barbican secrets they need in their environments.

Important

For legacy deployments using [oslo_policy]enforce_scope = False in their service configuration files, an additional step is required to allow users to create servers with encrypted local disks.

In a legacy deployment, users must have the creator role or the admin role assigned to them in Keystone in order to be allowed to create secrets in the Barbican key manager service. Otherwise, user requests to create servers with encrypted local disks will fail.

$ openstack role list
+----------------------------------+---------------------------+
| ID                               | Name                      |
+----------------------------------+---------------------------+
| 068b4910f0eb4a1cb6a4a2a1e94c3dfe | reader                    |
| 25dc4ed8f3814fd1941a580d78f2b635 | service                   |
| 7e832eeb2c2842c9b03c376bf3113247 | creator                   |
| 59df386beb0f460095b7622fc1a45e22 | member                    |
| 655bbf1b9f844399bcfbfbbef4248045 | admin                     |
+----------------------------------+---------------------------+

Create a new key manager secret for each block device mapping

The approach for disk image secrets is that each disk image has a unique secret.

For example:

Let’s say Instance A has 3 disks: one root disk, one ephemeral disk, and one swap disk. Each disk will have its own secret.

With qcow2, if an instance is created from an encrypted source image, the resulting backing file will have the same passphrase as the source image in order for the backing file to be shared among multiple instances. For each instance sharing the backing file, the instance has its own “copy” of the secret (a new Barbican secret that has the same passphrase).

This prevents a single point of failure with regard to Barbican secret deletion. For example, if 100 instances share the same encrypted backing file and a user mistakenly deletes a Barbican secret for the backing file, only one instance or image will be affected. If one Barbican secret were shared by the 100 instances using the same encrypted backing file, 100 instances and the source image would be affected.

Barbican does have a reference counting API for secret consumers which increments and decrements an internal counter over HTTP. If the count for a given secret becomes incorrectly zero for any reason, over time, (race conditions, etc), the API will allow deletion of that secret even if it is in use.

This table is intended to illustrate the way secrets are handled in various scenarios.

Instance or Image	Disk	Secret (passphrase)	Notes
Instance A	disk (root)	Secret 1	Secret 1, 2, and 3 will be automatically deleted by Nova when Instance A is deleted and its disks are destroyed
	disk.eph0	Secret 2
	disk.swap	Secret 3
Image Z (snapshot) created from Instance A	disk (root)	Secret 4 (copy of Secret 1 by default)	Secret 4 will not be automatically deleted and manual deletion will be needed if/when Image Z is deleted from Glance
Instance B created from Image Z (snapshot)	disk (root)	Secret 5 ^* (copy of Secret 4)	Secret 5, 6, 7, and 8 will be automatically deleted by Nova when Instance B is deleted and its disks are destroyed
	disk (root)	Secret 6
	disk.eph0	Secret 7
	disk.swap	Secret 8
Instance C	disk (root)	Secret 9	Secret 9, 10, and 11 will be automatically deleted by Nova when Instance C is deleted and its disks are destroyed
	disk.eph0	Secret 10
	disk.swap	Secret 11
Image Y (snapshot) created by shelve of Instance C	disk (root)	Secret 9	Secret 9 is reused when Instance C is shelved in part to prevent the possibility of a change in ownership of the root disk secret if, for example, an admin user shelves a non-admin user’s instance. This approach could be avoided if there is some way we could create a new secret using the instance’s user/project rather than the shelver’s user/project
Rescue disk created by rescue of Instance A	disk (root)	None	A rescue disk is only encrypted if an encrypted rescue image was specified.
Rescue disk created by rescue of Instance A using encrypted rescue image	disk (root)	Secret of encrypted rescue image	The secret of the encrypted rescue image will be reused and no new secrets will be created or deleted

^* backing file secret for qcow2 only

Encrypted source images

The default behavior when creating an instance from an encrypted source image will be to create encrypted disks. The reasoning is that we aim to avoid “surprise” decryption of images and that decryption should be something that a user or flavor or image has to opt-in to and explicitly request so the intent is clear.

Encrypted source images will have the os_encrypt_key_id, os_encrypt_format, and other [1] image properties in their image metadata. Access to the secret of the encrypted source image is determined by the key manager API policy and/or access control lists.

At this time, we expect to use a subset of the standardized image properties:

os_encrypt_format - to know how to interpret the image format

os_encrypt_key_id - to copy/convert/etc the source image if needed

When creating an instance with encrypted disks from an encrypted source image when hw_ephemeral_encryption has not been set, we will either use the presence of the automatically stored image_os_encrypt_key_id in system metadata or potentially store image_hw_ephemeral_encryption=true in the instance system metadata and use it to ensure an instance will be scheduled to a compute host which supports ephemeral encryption.

If the os_encrypt_key_id image property is set on the encrypted image and the image or flavor also has hw_ephemeral_encryption=false or hw:ephemeral_encryption=false explicitly set, we will reject the API request with a 409 conflict error at this time.

We could consider future work to interpret the aforementioned combination of image property settings as an intentional request to create an instance with unencrypted disks from the encrypted source image and perform the decryption.

Encrypted backing files (qcow2)

The approach regarding backing files is that they will be encrypted if the source image from which it was created is encrypted. If the source image from which the disk is created is not encrypted, the backing file stored internally in Nova will also not be encrypted. If the source image is encrypted, the backing file will also be encrypted.

An encrypted backing file uses the same passphrase as the source image from which it was created. This is required for the encrypted backing file to be shared among multiple instances in the same project.

Backing files for ephemeral disks and swap disks are never encrypted as they are always created from blank disks.

Snapshots of instances with ephemeral encryption

When an instance with ephemeral encryption is snapshotted, the behavior for encrypting the image snapshot is determined by request parameters which will be added to the snapshot API.

The API request parameters are intended to support workflows that involve sharing of encrypted image snapshots with other projects or users.

Examples:

An instance owner wants to back up their disk
An instance owner wants to make a copy of their disk that is encrypted with a new key
An instance owner wants to make a copy of their disk using an existing key that belongs to a different project or user (provided that project or user has created the necessary access control list for the secret)
An instance owner wants to create an unencrypted public copy of their disk
An instance owner with an unencrypted disk wants to make an encrypted copy to facilitate secure exfiltration of their disk to another location

New API microversion for Create Image (createImage Action)

A new microversion will be added to the create image API to support ephemeral encryption options. Users will be able to choose how they want encryption of the new image snapshot to be handled. They can use the same key as the image being snapshotted (the default), have Nova generate a new key and use it to encrypt the image snapshot, provide their own key secret UUID to use to encrypt the image snapshot, or not encrypt the image snapshot at all.

Request for `POST /servers/{server_id}/action` with `createImage`

Request
Name	In	Type	Description
server_id	path	string	The UUID of the server.
createImage	body	object	The action to create a snapshot of the image or the volume(s) of the server.
name	body	string	The display name of an Image.
metadata (Optional)	body	object	Metadata key and value pairs for the image. The maximum size for each metadata key and value pair is 255 bytes.
encryption (Optional)	body	object	Encryption options for the image to create. These options apply only to encrypted local disks.
encryption.key	body	string	The key to use to encrypt the image snapshot. Valid values are: `same`: Use the same key to encrypt the image snapshot. This is the default. `new`: Generate a new key and use it to encrypt the image snapshot. `existing`: The user will provide the UUID of an existing secret in the key manager service to use to encrypt the image snapshot. `none`: Do not encrypt the image snapshot.
encryption.secret_uuid (Optional)	body	string	The UUID of the key manager service secret that was used to encrypt the image snapshot.

{
    "createImage" : {
        "name" : "foo-image",
        "metadata": {
            "meta_var": "meta_val"
        },
        "encryption": {
            "key": "same|new|existing|none",
            "secret_uuid": "<secret uuid> if 'key' is 'existing', or absent"
        }
    }
}

Request choices for encryption.key:

same: Use the same key to encrypt the new disk image. This is the default.
new: Generate a new key to encrypt the new disk image.
existing: Use the provided <secret uuid> to encrypt the new disk image.
none: Do not encrypt the new disk image.

Note

Ceph release Quincy (v17) and older do not support creating a cloned image with an encryption key different from its parent. For this reason, the encryption.key request parameter with a value of new will not be supported with the rbd image backend for those versions of Ceph.

The plan if a user requests a snapshot with encryption.key and new and Ceph <= Quincy (v17), the snapshot server action will be marked as failed with a message that explains that new is not supported in the deployment.

See https://github.com/ceph/ceph/commit/1d3de19 for reference.

Response for `POST /servers/{server_id}/action` with `createImage`

(There will be no change to the response parameters.)

Response
Name	In	Type	Description
image_id	body	string	The UUID for the resulting image snapshot.

{
   "image_id": "0e7761dd-ee98-41f0-ba35-05994e446431",
}

Create Server Back Up (createBackup Action) API

The POST /servers/{server_id}/action API with createBackup will not be changed. Image snapshots created by this API will be encrypted using the same key.

Image metadata for image snapshots of encrypted disks

When an encrypted image snapshot is created, its image properties will be set to contain encryption information when Nova uploads it to Glance. There is a Glance spec proposed to establish a set of standardized image properties for all projects to use when working with encrypted Glance images:

os_encrypt_format - the main mechanism used, e.g. ‘LUKS’

os_encrypt_cipher - the cipher algorithm, e.g. ‘AES256’

os_encrypt_key_id - reference to key in the key manager

os_encrypt_key_deletion_policy - on image deletion indicates whether
the key should be deleted too

os_decrypt_container_format - format after payload decryption, e.g.
‘qcow’

os_decrypt_size - size after payload decryption

and will be used for snapshots of encrypted images in Nova.

When a new instance is created from an encrypted image, the image properties are passed down to the lower layers by their presence in the instance’s system metadata with image_ prefix. The system metadata is used because at the lower layers (where qemu-img convert is called, for example) we no longer have access to the image metadata and nontrivial refactoring to pass image metadata to several lower layer methods, or similar, would be required otherwise.

Snapshots created by shelving instances with ephemeral encryption

When an instance with ephemeral encryption is shelved, the existing root disk encryption secret is reused and will be used to unshelve the instance later. This is done to prevent a potential change in ownership of the root disk encryption secret in a scenario where an admin user shelves a non-admin user’s instance, for example. If a new secret were created owned by the admin user, the non-admin user who owns the instance would be unable to unshelve the instance.

This behavior could be avoided however if there is some way we could create a new encryption secret using the instance’s user and project rather than the shelver’s user and project. If that were possible, we would not need to reuse the encryption secret.

Rescue disk images created when the rescue image is encrypted

When rescuing an instance and an encrypted rescue image is specified, the rescue image secret UUID from the image property will be used to encrypt the rescue disk. A new key manager secret will not be created.

The rescue image secret is used because it will exist whether the instance has an encrypted root disk or not. It is technically possible to specify an encrypted rescue image for an instance that does not otherwise have encrypted local disks.

The rescue disk will be encrypted if and only if the rescue image is encrypted, with the objective of not creating unencrypted data at rest from data that is currently encrypted at rest.

The new virt driver secret will be created for the rescue disk and is deleted when the instance is unrescued.

Cleanup of ephemeral encryption secrets

Ephemeral encryption secrets are deleted from the key manager and the virt driver when the corresponding instance is deleted and its disks are destroyed.

Virt driver secrets may be created on destination hosts and deleted from source hosts as needed during instance migrations.

Key manager secrets are however only deleted when the disks associated with them are destroyed.

Encryption secrets that are created when a snapshot is created are never deleted by Nova. It would only be acceptable to delete the secret if and when the image snapshot is deleted from Glance. There is a os_encrypt_deletion_policy image property proposed in the standardized Glance image properties that Nova will set to tell Glance to go ahead and delete the key manager secret for the image at the same time the image is deleted.

BlockDeviceMapping changes

The BlockDeviceMapping object will be extended to include the following fields encapsulating some of the above information per ephemeral disk within the instance:

encrypted: A simple boolean to indicate if the block device is encrypted. This will initially only be populated when ephemeral encryption is used but could easily be used for encrypted volumes as well in the future.
encryption_secret_uuid: As the name suggests this will contain the UUID of the associated encryption secret for the disk. The type of secret used here will be specific to the encryption format and virt driver used, it should not be assumed that this will always been an symmetric key as is currently the case with all encrypted volumes provided by Cinder. For example, for luks based ephemeral storage this secret will be a passphrase.
backing_encryption_secret_uuid: This will contain the UUID of the associated encryption secret for the backing file for the disk in the case of qcow2.
encryption_format: A new BlockDeviceEncryptionFormatType enum and associated BlockDeviceEncryptionFormatTypeField field listing the encryption format. The available options being kept in line with the constants currently provided by os-brick and potentially merged in the future if both can share these types and fields somehow.
encryption_options: A simple unversioned dict of strings containing encryption options specific to the virt driver implementation, underlying hypervisor and format being used.

Note

The encryption_options field may be used to store the encryption parameters that were used to create the disk such as cipher algorithm, cipher mode, and initialization vector generator algorithm.

The intention will be to be able to track the encryption attributes of each disk to aid in handling future upgrade scenarios such as removal of an algorithm or a change in a default in QEMU.

Populate ephemeral encryption BlockDeviceMapping attributes during build

When launching an instance with ephemeral encryption requested via either the image or flavor the BlockDeviceMapping.encrypted attribute will be set to True for each BlockDeviceMapping record with a destination_type value of local. This will happen after the original API BDM dicts have been transformed into objects within the Compute API but before scheduling the instance(s).

The encryption_format attribute will also take its value from the image or flavor if provided. Any differences or conflicts between the image and flavor for this will raise a 409 Conflict error being raised by the API.

Use `COMPUTE_EPHEMERAL_ENCRYPTION` compatibility traits

A COMPUTE_EPHEMERAL_ENCRYPTION compute compatibility trait was introduced during Wallaby and will be reported by virt drivers to indicate overall support for ephemeral storage encryption using this new approach. This trait will always be used by pre-filter outlined in the following section when ephemeral encryption has been requested, regardless of any format being specified in the request, allowing the compute that eventually handles the request to select a format it supports using the [ephemeral_storage_encryption]/default_format configurable.

COMPUTE_EPHEMERAL_ENCRYPTION_$FORMAT compute compatibility traits were also added to os-traits during Wallaby and will be reported by virt drivers to indicate support for specific ephemeral storage encryption formats. For example:

COMPUTE_EPHEMERAL_ENCRYPTION_LUKS
COMPUTE_EPHEMERAL_ENCRYPTION_LUKSV2
COMPUTE_EPHEMERAL_ENCRYPTION_PLAIN

These traits will only be used alongside the COMPUTE_EPHEMERAL_ENCRYPTION trait when the hw_ephemeral_encryption_format image property or hw:ephemeral_encryption_format extra spec have been provided in the initial request.

Introduce an ephemeral encryption request pre-filter

A new pre-filter will be introduced that adds the above traits as required to the request spec when the aforementioned image properties or flavor extra specs are provided. As outlined above this will always include the COMPUTE_EPHEMERAL_ENCRYPTION trait when ephemeral encryption has been requested and may optionally include one of the format specific traits if a format is included in the request.

Expose ephemeral encryption attributes via block_device_info

Once the BlockDeviceMapping objects have been updated and the instance scheduled to a compute the objects are transformed once again into a block_device_info dict understood by the virt layer that at present contains the following:

root_device_name: The root device path used by the instance.
ephemerals: A list of DriverEphemeralBlockDevice dict objects detailing the ephemeral disks attached to the instance. Note this does not include the initial image based disk used by the instance that is classified as an ephemeral disk in terms of the ephemeral encryption feature.
block_device_mapping: A list of DriverVol*BlockDevice dict objects detailing the volume based disks attached to the instance.
swap: An optional DriverSwapBlockDevice dict object detailing the swap device.

For example:

{
    "root_device_name": "/dev/vda",
    "ephemerals": [
        {
            "guest_format": null,
            "device_name": "/dev/vdb",
            "device_type": "disk",
            "size": 1,
            "disk_bus": "virtio"
        }
    ],
    "block_device_mapping": [],
    "swap": {
        "swap_size": 1,
        "device_name": "/dev/vdc",
        "disk_bus": "virtio"
    }
}

As noted above block_device_info does not provide a complete overview of the storage associated with an instance. In order for it to be useful in the context of ephemeral storage encryption we would need to extend the dict to always include information relating to local image based disks.

As such a new DriverImageBlockDevice dict class will be introduced covering image based block devices and provided to the virt layer via an additional image key within the block_device_info dict when the instance uses such a disk. As with the other Driver*BlockDevice dict classes this will proxy access to the underlying BlockDeviceMapping object allowing the virt layer to lookup the previously listed encrypted and encryption_* attributes.

While outside the scope of this spec the above highlights a huge amount of complexity and technical debt still residing in the codebase around how storage configurations are handled between the different layers. In the long term we should plan to remove block_device_info and replace it with direct access to BlockDeviceMapping based objects ensuring the entire configuration is always exposed to the virt layer.

Report that a disk is encrypted at rest through the metadata API

Extend the metadata API so that users can confirm that their ephemeral storage is encrypted at rest through the metadata API, accessible from within their instance.

{
    "devices": [
        {
            "type": "nic",
            "bus": "pci",
            "address": "0000:00:02.0",
            "mac": "00:11:22:33:44:55",
            "tags": ["trusted"]
        },
        {
            "type": "disk",
            "bus": "virtio",
            "address": "0:0",
            "serial": "12352423",
            "path": "/dev/vda",
            "encrypted": "True"
        },
        {
            "type": "disk",
            "bus": "ide",
            "address": "0:0",
            "serial": "disk-vol-2352423",
            "path": "/dev/sda",
            "tags": ["baz"]
        }
    ]
}

This should also be extended to cover disks provided by encrypted volumes but this is obviously out of scope for this implementation.

Block resize between flavors with different `hw:ephemeral_encryption` values

Ephemeral data is expected to persist through a resize and as such any resize between flavors that differed in their configuration of ephemeral encryption (one enabled, another disabled or formats etc) would cause us to convert this data in place. This isn’t trivial and so for this initial implementation resizing between flavors that differ will be blocked.

Support for resizing between flavors with different ephemeral encryption parameters is planned to be added in a separate patch later in the series.

Provide a migration path from the legacy implementation

New nova-manage and nova-status commands will be introduced to migrate any instances using the legacy libvirt virt driver implementation ahead of the removal of this in a future release.

The nova-manage command will ensure that any existing instances with ephemeral_key_uuid set will have their associated BlockDeviceMapping records updated to reference said secret key, the legacy_dmcrypt_plain encryption format and configured options on the host before clearing ephemeral_key_uuid.

Additionally the libvirt virt driver will also attempt to migrate instances with ephemeral_key_uuid set during spawn. This should allow at least some of the instances to be moved during the W release ahead of X.

The nova-status command will simply report on the existence of any instances with ephemeral_key_uuid set that do not have the corresponding BlockDeviceMapping attributes enabled etc.

Deprecate the now legacy implementation

The legacy implementation within the libvirt virt driver will be deprecated for removal in a future release once the ability to migrate is in place.

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

See above for the various flavor extra spec, image property, BlockDeviceMapping and DriverBlockDevice object changes.

REST API impact

A new API microversion will be created to add encryption options to the createImage server action API.
Flavor extra specs and image property validation will be introduced for the any ephemeral encryption provided options.
Attempts to resize between flavors that differ in their ephemeral encryption options will be rejected.
Attempts to rebuild between images that differ in their ephemeral encryption options will be allowed by the user who owns the instance. Requests to rebuild between images that differ in their ephemeral encryption options will be rejected. This is to prevent a change in the ownership of secrets for the instance disks.
The metadata API will be changed to allow users to determine if their ephemeral storage is encrypted as discussed above.

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Additionally this should allow additional virt drivers to support ephemeral storage encryption while also allowing the libvirt virt driver to increase coverage of the feature across more image backends such as qcow2 and rbd.

Notifications impact

N/A

Other end user impact

Users may be able to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavor.

Performance Impact

The additional pre-filter will add a small amount of overhead when scheduling instances but this should fail fast if ephemeral encryption is not requested through the image or flavor.

The performance impact of increased use of ephemeral storage encryption by instances is left to be discussed in the virt driver specific specs as this will vary between hypervisors.

Other deployer impact

N/A

Developer impact

Virt driver developers will be able to indicate support for specific ephemeral storage encryption formats using the newly introduced compute compatibility traits.

Upgrade impact

The compute traits should ensure that requests to schedule instances using ephemeral storage encryption with mixed computes (N-1 and N) will work during a rolling upgrade.

As discussed earlier in the spec future upgrades will need to provide a path for existing ephemeral storage encryption users to migrate from the legacy implementation. This should be trivial but may require an additional grenade based job in CI during the W cycle to prove out the migration path.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Introduce hw_ephemeral_encryption image properties and hw:ephemeral_encryption flavor extra specs.
Introduce a new encrypted. encryption_secret_uuid, backing_encryption_secret_uuid, encryption_format and encryption_options attributes to the BlockDeviceMapping Object.
Wire up the new BlockDeviceMapping object attributes through the Driver*BlockDevice layer and block_device_info dict.
Report ephemeral storage encryption through the metadata API.
Introduce new nova-manage and nova-status commands to allow existing users to migrate to this new implementation. This should however be blocked outside of testing until a virt driver implementation is landed.
Validate all of the above in functional tests ahead of any virt driver implementation landing.

Dependencies

None

Testing

At present without a virt driver implementation this will be tested entirely within our unit and functional test suites.

Once a virt driver implementation is available additional integration tests in Tempest and whitebox tests can be written.

Testing of the migration path from the legacy implementation will require an additional grenade job but this will require the libvirt virt driver implementation to be completed first.

Documentation Impact

The new host configurables, flavor extra specs and image properties should be documented.
New user documentation should be written covering the overall use of the feature from a Nova point of view.
Reference documentation around BlockDeviceMapping objects etc should be updated to make note of the new encryption attributes.

References

https://review.opendev.org/c/openstack/glance-specs/+/915726

History

Revisions
Release Name	Description
Wallaby	Introduced
Xena	Reproposed
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed
2023.2 Bobcat	Reproposed
2024.1 Caracal	Reproposed
2024.2 Dalmatian	Reproposed

Example Spec - The title of your blueprint

Fri, 19 Jan 2024 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2024.2 Dalmatian	Introduced

PCI Passthrough Groups

Tue, 31 Oct 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/pci-passthrough-groups

This spec allows operators to create a flavor using a PCI alias to request a group of PCI devices. These groups of PCI devices are tracked as a single indivisible unit within Placement. The default custom resource class used to track these PCI groups is derived from the PCI group type name, and the name of the inventory is derived from the PCI group name. The pci_alias config already supports mapping to a specific placement resource class.

Problem description

Some PCI devices only make sense to be consumed as a group. When you assign the grouped PCI devices to a VM, all of the devices in the group as always consumed together by a single VM. Currently Nova does not understand any grouping other than NUMA affinity.

While there are some cases where a device could be consumed by multiple different groups, that are dynamically picked based on demand, we are ignoring these use cases for now. In particular, we make the simplifying restriction that a tracked PCI device can only be a member of a single group, and when a PCI device is a member of a group, it can only be used as part of that PCI group.

Use Cases

Some GPUs expose both a graphics physical function and an audio function. In order to support passing through both devices, we need to ensure that we pass through a matching pair of devices. This spec would allow a device group to be created such that operators configure the matching pairs of audio and graphics devices, and users can request one (or more) of those pairs via the usual PCI alias.

Note, we are currently excluding the use case of users requesting either the pair of devices or just the graphics device, as that would result in additional complexity that should be considered in a separate follow on specification.

Let us consider the specific case of the Graphcore C200 device, where a set of PCI cards are connected together via IPU-Link: https://docs.graphcore.ai/projects/C600-datasheet/en/latest/product-description.html#ipu-link-cables

Each physical card presents two PCI devices. The card can be used independently of other cards if a matched pair of devices are presented to the VM. PCI groups allows this device to be correctly passed through to VMs by ensuring a matched pair of PCI devices are always assigned to each VM.

In addition, some servers can be statically configured to group either two devices, four devices or eight devices as a single group. These can all be statically configured using PCI group to ensure we always respect the non-PCI connectivity between those PCI devices.

Proposed change

The key parts of this change include:

extend [pci]device_spec to model groups of PCI devices
devices are linked by both a group type name, and a specific group name
the group type name is used to generate a custom resource class, i.e. CUSTOM_PCI_GROUP<group_type_name>. Note this is just the default that changes when you specify a group type name, and it can be overrriden by explicitly specifying a different resource_class tag.
Each group is registered in placement, in a similar way to a device. Each group being a separate resource provider with a single inventory item for the associated group type custom resource type, with a name that is generated from the group_name rather than the PCI device address
extend [pci]alias simply mapps to the resource class mentioned above, such as CUSTOM_PCI_GROUP_<group_type_name>.
PCI tracker will have the group_name and group_type_name added to each device that is being tracked, such that we can look up a group of devices associated with each specific named group tracked in placement.

There will be configuration validation checks:

pci groups are only supported when PCI devices are tracked in placement
all device groups must have two or more PCI devices
each physical PCI device can only be in one group, and must only be tracked in placement once

For example, lets consider the following PCI devices:

4e:00.0 Processing accelerators: Graphcore Ltd Device 0003
4f:00.0 Processing accelerators: Graphcore Ltd Device 0003
89:00.0 Processing accelerators: Graphcore Ltd Device 0003
8a:00.0 Processing accelerators: Graphcore Ltd Device 0003

The two physical cards, spread across two NUMA nodes can be presented in two possible ways: either two groups or a single group, depending on the use cases. For example, two separate devices would be::

[pci]
device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x1"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x1"}
device_spec = {"address": ":4e:00.0", group_name:"graphcore_2", group_type:"c200_x1"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_2", group_type:"c200_x1"}
alias = {"name":"c200_x1", resource_class:"CUSTOM_PCI_GROUP_C200_X1"}

But exposing the two cards, exposed as four PCI devices, as a single unit of 4 PCI devices, would look like this::

[pci]
device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
device_spec = {"address": ":4e:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
device_spec = {"address": ":4f:00.0", group_name:"graphcore_1", group_type:"c200_x2"}
alias = {"name":"c200_x2", resource_class:"CUSTOM_PCI_GROUP_C200_X2"}

Alternatives

For some simple cases, NUMA affinity can simulate what is required. But currently hardware like Graphcore C200 does not work well with Nova.

Data model impact

PCI tracker needs to be extended to include group_name and group_type for each PCI device.

REST API impact

No impact

Security impact

No impact

Notifications impact

No impact

Other end user impact

No impact

Performance Impact

No impact

Other deployer impact

The device spec configuration gets some extra options to help define groups, and the default resource class changes when you use the new device_group tags, as discussed above.

Developer impact

None

Upgrade impact

Devices that are exposed as a group must be not currently tracked in placement when starting to expose them as a group.

Once new compute nodes will report the new resoruce classes, which should naturally gate the need for older compute nodes to know what to do with the new PCI device configuration.

Implementation

Assignee(s)

Primary assignee:: johngarbutt
Other contributors:: nathanharper

Feature Liaison

Feature liaison:: gibi?

Work Items

Update pci device config to support pci groups
Update PCI device tracker to know about pci groups
Attach groups of devices when device alias requests a resource class that maps to a PCI device group
Update placement with the avilable resources from the described pci groups

Dependencies

None

Testing

Add a functional test, similar to vgpu tests.

Documentation Impact

Configuration changes need to be documented correctly.

References

None

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2024.1 Caracal	Introduced

Allow Manila shares to be directly attached to an instance when using libvirt

Mon, 16 Oct 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-virtiofs-attach-manila-shares

Problem description

Use Cases

As an operator I want the Manila datapath to be separate to any tenant accessible networks.
As a user I want to attach Manila shares directly to my instance and have a simple interface with which to mount them within the instance.
As a user I want to detach a directly attached Manila share from my instance.
As a user I want to track the Manila shares attached to my instance.

Proposed change

Support for move operations once a share is attached will also not be covered by this spec, any requests to shelve, evacuate, resize, cold migrate or live migrate an instance with a share attached will be rejected with a HTTP409 response for the time being.

A new server shares API will be introduced under a new microversion. This will list current shares, show their details and allow a share to be attached or detached.

Note

The libvirt driver will be extended to support the above with initial support for cold attach and detach. Future work will aim to add live attach and detach once support lands in libvirt itself.

COMPUTE_STORAGE_VIRTIO_FS trait

and either the

COMPUTE_MEM_BACKING_FILE trait

that the instance is configured with hw:mem_page_size extra spec.

From an operator’s point of view, it means COMPUTE_STORAGE_VIRTIO_FS support requires that operators must upgrade all their compute nodes to the version supporting shares using virtiofs.

Users will be able to mount the attached shares using a mount tag, this is either the share UUID from Manila or a string provided by the users with their request to attach the share.

user@instance $ mount -t virtiofs $tag /mnt/mount/path

Share mapping status:

                     +----------------------------------------------------+   Reboot VM
    Start VM         |                                                    | --------------+
    Share mounted    |                       active                       |               |
+------------------> |                                                    | <-------------+
|                    +----------------------------------------------------+
|                      |                   |             |
|                      | Stop VM           |             |
|                      | Fail to umount    |             |
|                      v                   |             |
|                    +------------------+  |             |
|                    |      error       | <+-------------+-------------------+
|                    +------------------+  |             |                   |
|                      |                   |             |                   |
|                      | Detach share or   |             |                   |
|                      | delete VM         | Delete VM   |                   |
|                      v                   |             |                   |
|                    +------------------+  |             |                   |
|    +-------------> |        φ         | <+             |                   | Start VM
|    |               +------------------+                |                   | Fail to mount
|    |                 |                                 |                   |
|    | Detach share    |                                 | Stop VM           |
|    | or delete VM    | Attach share                    | Share unmounted   |
|    |                 v                                 v                   |
|    |               +----------------------------------------------------+  |
|    +-------------- |                      inactive                      | -+
|                    +----------------------------------------------------+
|                      |
+----------------------+

φ: means no entry in the database. No association between a share and a server.
Attach share: means POST /servers/{server_id}/shares
Detach share: means DELETE /servers/{server_id}/shares

This chart describe the share mapping status (nova), this is independent from the status of the Manila share.

Share attachment/detachment can only be done if the VM state is STOPPED or ERROR. These are operations only on the database, and no RPC calls will be required to the compute API. This is an intentional design for this spec. As a result, this could lead to situation where the VM start operation fails as an underlying share attach fails.

Umount operation will be really done when the share is mounted and not used anymore by another server.

With the above mount and umount operation, the state is stored in memory and do not require a lookup in the database.

Manila share removal issue:

Despite a share being used by instances, it can be removed by the user. As a result, the instances will lose access to the data and might cause difficulties in removing the missing share and fixing the instance. This is an identified issue that requires Manila modifications. A solution was identified with the Manila team to attach metadata to the share access-allow policy that will lock the share and prevent its deletion until the lock is not removed. If the above Manila change can land in the Zed cycle, the proposal here is to use the lock mechanism in Nova. Otherwise, clearly document the known issue as unsupported and warn users that they should take care and avoid this pitfall.

Instance metadata:

Add instace shares in the instance metadata. Extend DeviceMetadata with ShareMetadata object containing shareId and tag used to mount the virtiofs on an instance by the user. See Other end user impact.

Alternatives

REST API impact

A new server level shares API will be introduced under a new microversion with the following methods:

GET /servers/{server_id}/shares

List all shares attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "shares": [
        {
            "shareId": "48c16a1a-183f-4052-9dac-0e4fc1e498ad",
            "status": "active",
            "tag": "foo"
        },
        {
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar"
        }
    ]
}

GET /servers/{server_id}/shares/{shareId}

Show details of a specific share attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar"
    }
}

PROJECT_ADMIN will be able to see details of the attachment id and export location stored within Nova:

{
    "share": {
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

POST /servers/{server_id}/shares

Attach a share to an instance.

Prerequisite(s):

Instance much be in the SHUTOFF state.
Instance should have the required capabilities to enable virtiofs (see above).

This is a synchronous API. As a result, the VM share attachement state is defined in the database and set as inactive. Then, power on the VM will do the required operations to attach the share and set it as active if there are no errors.

Return Code(s): 202,400,401,403,404,409

Request body:

Note

tag will be an optional request parameter in the request body, when not provided it will be the shareId(UUID) as always provided in the request.

tag if povided by the user must be an ASCII string with a maximum lenght of 64 bytes.

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

Response body:

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

DELETE /servers/{server_id}/shares/{shareId}

Detach a share from an instance.

Prerequisite(s): Instance much be in the SHUTOFF or ERROR state.

Return Code(s): 202,400,401,403,404,409

Data model impact

A new share_mapping database table will be introduced.

id - Primary key autoincrement
uuid - Unique UUID to identify the particular share attachment
instance_uuid - The UUID of the instance the share will be attached to
share_id - The UUID of the share in Manila
status - The status of the share attachment within Nova (active, inactive, error)
tag - The device tag to be used by users to mount the share within the instance.
export_location - The export location used to attach the share to the underlying host
share_proto - The Shared File Systems protocol (NFS, CEPHFS)

A new base ShareMapping versioned object will be introduced to encapsulate the above database entries and to be used as the parent class of specific virt driver implementations.

Fields containing text will use String and not Text type in the database schema to limit the column width and be stored inline in the database.

This base ShareMapping object will provide stub attach and detach methods that will need to be implemented by any child objects.

New ShareMappingLibvirt, ShareMappingLibvirtNFS and ShareMappingLibvirtCephFS objects will be introduced as part of the libvirt implementation.

Security impact

Notifications impact

New notifications will be added:

One to add new notifications for share attach and share detach.
One to extend the instance update notification with the share mapping information.

Share mapping in the instance payload will be optional and controlled via the include_share_mapping notification configuration parameter. It will be disabled by default.

Proposed payload for attached and detached notification will be the same as the one returned by the show command with admin rights.

{
    "share": {
        "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

Proposed instance payload for instance updade, will be the list of share attached to this instance.

{
    "shares":
    [
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar",
            "export_location": "server.com/nfs_mount,foo=bar"
        },
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-ffffffffffff",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7987",
            "status": "active",
            "tag": "baz",
            "export_location": "server2.com/nfs_mount,foo=bar"
        }
    ]
}

Other end user impact

Users will need to mount the shares within their guestOS using the returned tag.

Users could use the instance metadata to discover and auto mount the share.

Performance Impact

Other deployer impact

None

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)
Other contributors:: lyarwood (initial contributor)

Feature Liaison

Feature liaison:: uggla

Work Items

Add new capability traits within os-traits
Add support within the libvirt driver for cold attach and detach
Add new shares API and microversion

Dependencies

None

Testing

Functional libvirt driver and API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Yoga	Introduced
Zed	Reproposed
Antelope	Reproposed
Bobcat	Reproposed
Caracal	Reproposed

libvirt driver support for flavor and image defined ephemeral encryption

Thu, 05 Oct 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-encryption-libvirt

This spec outlines the specific libvirt virt driver implementation to support the Flavor and Image defined ephemeral storage encryption [1] spec.

Problem description

The libvirt virt driver currently provides very limited support for ephemeral disk encryption through the LVM imagebackend and the use of the PLAIN encryption format provided by dm-crypt.

Use Cases

As a user of a cloud with libvirt based computes I want to request that all of my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user of a cloud with libvirt based computes I want to be able to pick how my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user I want each encrypted ephemeral disk attached to my instance to have a separate unique secret associated with it.
As an operator I want to allow users to request that the ephemeral storage of their instances is encrypted using the flexible LUKSv1 encryption format.

Proposed change

Deprecate the legacy implementation within the libvirt driver

The legacy implementation using dm-crypt within the libvirt virt driver needs to be deprecated ahead of removal in a future release, this includes the following options:

[ephemeral_storage_encryption]/enabled
[ephemeral_storage_encryption]/cipher
[ephemeral_storage_encryption]/key_size

Limited support for dm-crypt will be introduced using the new framework before this original implementation is removed.

Populate disk_info with encryption properties

This dict currently contains the following:

disk_bus: The default bus used by disks
cdrom_bus: The default bus used by cd-rom drives
mapping: A nested dict keyed by disk name including information about each disk.

Each item within the mapping dict containing following keys:

bus: The bus for this disk
dev: The device name for this disk as known to libvirt
type: A type from the BlockDeviceType enum (‘disk’, ‘cdrom’,’floppy’, ‘fs’, or ‘lun’)

It can also contain the following optional keys:

format: Used to format swap/ephemeral disks before passing to instance (e.g. ‘swap’, ‘ext4’)
boot_index: The 1-based boot index of the disk.

In addition to the above this spec will also optionally add the following keys for encrypted disks:

encryption_format: The encryption format used by the disk
encryption_options: A dict of encryption options
encryption_secret_uuid: The UUID of the encryption secret associated with the disk

Handle ephemeral disk encryption within imagebackend

With the above in place we can now add encryption support within each image backend. As highlighted at the start of this spec this initial support will only be for the LUKSv1 encryption format.

Generic key management code will be introduced into the base nova.virt.libvirt.imagebackend.Image class and used to create and store the encryption secret within the configured key manager. The initial LUKSv1 support will store a passphrase for each disk within the key manager. This is unlike the current ephemeral storage encryption or encrypted volume implementations that currently store a symmetric key in the key manager. This remains a long running piece of technical debt in the encrypted volume implementation as LUKSv1 does not directly encrypt data with the provided key.

Each backend will then be modified to encrypt disks during nova.virt.libvirt.imagebackend.Image.create_image using the provided format, options and secret.

Enable the `COMPUTE_EPHEMERAL_ENCRYPTION_LUKS` trait

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

As discussed above the ephemeral encryption keys will be added to the disk_info for individual disks within the libvirt driver.

REST API impact

N/A

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

Other deployer impact

N/A

Developer impact

Upgrade impact

The legacy implementation is deprecated but will continue to work for the time being. As the new implementation is separate there is no further upgrade impact.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Populate the individual disk dicts within disk_info with any ephemeral encryption properties.
Provide these properties to the imagebackends when creating each disk.
Introduce support for LUKSv1 based encryption within the imagebackends.
Enable the COMPUTE_EPHEMERAL_ENCRYPTION_LUKS trait when the selected imagebackend supports LUKSv1.

Dependencies

Flavor and Image defined ephemeral storage encryption [1]

Testing

Documentation Impact

New user documentation around the specific LUKSv1 support for ephemeral encryption within the libvirt driver.
Reference documentation around the changes to the virt block device layer.
Document that for the raw imagebackend, both [libvirt]images_type = raw and [DEFAULT]use_cow_images = False must be configured in order for resize to work. This is also true without encryption but it may still be helpful to users.
Document that a user must have policy permission to create secrets in Barbican in order for encryption to work for that user. Secrets are created in Barbican using the user’s auth token. Admins have permission to create secrets in Barbican by default.

References

Revisions
Release Name	Description
Wallaby	Introduced
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed
2023.2 Bobcat	Reproposed
2024.1 Caracal	Reproposed

Flavour and Image defined ephemeral storage encryption

Thu, 05 Oct 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-storage-encryption

Note

This spec will only cover the high level changes to the API and compute layers, implementation within specific virt drivers is left for separate specs.

Problem description

Use Cases

As a user I want to request that all of my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As a user I want to be able to pick how my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As an admin/operator I want to either enforce ephemeral encryption per flavor or per image.
As an admin/operator I want to provide sane choices to my end users regarding how their ephemeral storage is encrypted at rest.
As a virt driver maintainer/developer I want to indicate that my driver supports ephemeral storage encryption using a specific encryption format.
As a virt driver maintainer/developer I want to provide sane default encryption format and options for users looking to encrypt their ephemeral storage at rest. I want these associated with the encrypted storage until it is deleted.

Proposed change

To enable this new flavor extra specs, image properties and host configurables will be introduced. These will control when and how ephemeral storage encryption at rest is enabled for an instance.

Note

Separate image properties have been documented in the Glance image encryption and Cinder image encryption specs to cover how images can be encrypted at rest within Glance.

Allow ephemeral encryption to be configured by flavor, image or config

To enable ephemeral encryption per instance the following boolean based flavor extra spec and image property will be introduced:

hw:ephemeral_encryption
hw_ephemeral_encryption

The above will enable ephemeral storage encryption for an instance but does not control the encryption format used or the associated options. For this the following flavor extra specs, image properties and configurables will be introduced.

The encryption format used will be controlled by the following flavor extra specs and image properties:

hw:ephemeral_encryption_format
hw_ephemeral_encryption_format

When neither of the above are provided but ephemeral encryption is still requested an additional host configurable will be used to provide a default format per compute, this will initially default to luks:

[ephemeral_storage_encryption]/default_format

This could lead to requests against different clouds resulting in a different ephemeral encryption format being used but as this is transparent to the end user from within the instance it shouldn’t have any real impact.

The format will be provided as a string that maps to a BlockDeviceEncryptionFormatTypeField oslo.versionedobjects field value:

plain for the plain dm-crypt format
luks for the LUKSv1 format

To enable snapshot and shelve of instances using ephemeral encryption, the UUID of the encryption secret is stored in the key manager for the resultant image will be kept with the image as an image property:

hw_ephemeral_encryption_secret_uuid

The secret UUID is needed when creating an instance from an ephemeral encrypted snapshot or when unshelving an ephemeral encrypted instance.

Create a new key manager secret for every new encrypted disk image

The approach for disk image secrets is to never share secrets between different disk images and that each disk image has a unique secret. This is done to address both 1) the security implications and 2) the logistics of cleaning up secrets that are no longer in use.

For example:

Let’s say Instance A has 3 disks: one root disk, one ephemeral disk, and one swap disk. Each disk will have its own secret.

This table is intended to illustrate the way secrets are handled in various scenarios.

+--------------------+-------------+--------------+------------------------------------------------------+
| Instance or Image  | Disk        | Secret       | Notes                                                |
|                    |             | (passphrase) |                                                      |
+====================+=============+==============+======================================================+
| Instance A         | disk (root) | Secret 1     | Secret 1, 2, and 3 will be automatically deleted     |
|                    +-------------+--------------+ by Nova when Instance A is deleted and its disks are |
|                    | disk.eph0   | Secret 2     | destroyed                                            |
|                    +-------------+--------------+                                                      |
|                    | disk.swap   | Secret 3     |                                                      |
+--------------------+-------------+--------------+------------------------------------------------------+
| Image Z (snapshot) | disk (root) | Secret 4     | Secret 4 will *not* be automatically deleted and     |
| created from       |             | (new secret  | manual deletion will be needed if/when Image Z is    |
| Instance A         |             |  is created) | deleted from Glance                                  |
+--------------------+-------------+--------------+------------------------------------------------------+
| Instance B         | disk (root) | Secret 5     | Secret 5, 6, and 7 will be automatically deleted     |
| created from       +-------------+--------------+ by Nova when Instance B is deleted and its disks are |
| Image Z (snapshot) | disk.eph0   | Secret 6     | destroyed                                            |
|                    +-------------+--------------+                                                      |
|                    | disk.swap   | Secret 7     |                                                      |
+--------------------+-------------+--------------+------------------------------------------------------+
| Instance C         | disk (root) | Secret 8     | Secret 8, 9, and 10 will be automatically deleted    |
|                    +-------------+--------------+ by Nova when Instance C is deleted and its disks are |
|                    | disk.eph0   | Secret 9     | destroyed                                            |
|                    +-------------+--------------+                                                      |
|                    | disk.swap   | Secret 10    |                                                      |
+--------------------+-------------+--------------+------------------------------------------------------+
| Image Y (snapshot) | disk (root) | Secret 8     | Secret 8 is *retained* when Instance C is shelved in |
| created by shelve  |             |              | part to prevent the possibility of a change in       |
| of Instance C      |             |              | ownership of the root disk secret if, for example,   |
|                    |             |              | an admin user shelves a non-admin user's instance.   |
|                    |             |              | This approach could be avoided if there is some way  |
|                    |             |              | we could create a new secret using the instance's    |
|                    |             |              | user/project rather than the shelver's user/project  |
+--------------------+-------------+--------------+------------------------------------------------------+
| Rescue disk        | disk (root) | Secret 11    | Secret 11 is stashed in the instance's system        |
| created by rescue  |             | (new secret  | metadata with key                                    |
| of Instance A      |             |  is created) | ``rescue_disk_ephemeral_encryption_secret_uuid``.    |
|                    |             |              | This is done because a BDM record for the rescue     |
|                    |             |              | disk is not going to be persisted to the database.   |
+--------------------+-------------+--------------+------------------------------------------------------+

Snapshots of instances with ephemeral encryption

When an instance with ephemeral encryption is snapshotted, a new encryption secret is created and its key manager secret UUID is kept as an image property hw_ephemeral_encryption_secret_uuid and the image is uploaded to Glance.

When a new instance is created from an encrypted image, the image property hw_ephemeral_encryption_secret_uuid is passed down to the lower layers by storing it in the instance’s system metadata with key image_hw_ephemeral_encryption_secret_uuid. This is done because at the lower layers (where qemu-img convert is called, for example) we no longer have access to the image metadata and refactoring to pass image metadata to several lower layer methods, or similar, would be required otherwise.

Snapshots created by shelving instances with ephemeral encryption

When an instance with ephemeral encryption is shelved, the existing root disk encryption secret is retained and will be used to unshelve the instance later. This is done to prevent a potential change in ownership of the root disk encryption secret in a scenario where an admin user shelves a non-admin user’s instance, for example. If a new secret were created owned by the admin user, the non-admin user who owns the instance will be unable to unshelve the instance.

This behavior could be avoided however if there is some way we could create a new encryption secret using the instance’s user and project rather than the shelver’s user and project. If that is possible, we would not need to reuse the encryption secret.

Rescue disk images created by rescuing instances with ephemeral encryption

When rescuing an instance and an encrypted rescue image is specified, the rescue image secret UUID from the image property will be stashed in the instance’s system metadata with key rescue_image_hw_ephemeral_encryption_secret_uuid to pass it down to the lower layers. This is considered separate from image_hw_ephemeral_encryption_secret_uuid which means the encrypted image from which the instance was created. Another reason to keep it separate is to avoid confusion for those reading or working on the code.

A new encryption secret is created when the rescue disk is created and its UUID is stashed in the instance’s system metadata with key rescue_disk_ephemeral_encryption_secret_uuid. This is done because a block device mapping record for the rescue disk is not going to be persisted to the database.

The corresponding virt driver secret name pattern is <instance UUID>_rescue_disk and any existing secrets with that name are deleted by the virt driver when a new rescue is requested.

The new encryption secret for the rescue disk is deleted from the key manager and the virt driver secret is also deleted when the instance is unrescued.

Cleanup of ephemeral encryption secrets

Ephemeral encryption secrets are deleted from the key manager and the virt driver when the corresponding instance is deleted and its disks are destroyed. The approach is that encryption secrets are only deleted when the disks associated with them are destroyed.

Encryption secrets that are created when a snapshot is created are never deleted by Nova. It would only be acceptable to delete the secret if and when the snapshot image is deleted. Cleanup of secrets whose images have been deleted from Glance must be deleted manually by the user or an admin.

Note

At the time of this writing, the newest Ceph release v17 (Quincy) does not support creating a cloned image with an encryption key different from its parent. For this reason, copy-on-write cloning will not be enabled for instances which have specified ephemeral encryption.

Support for creating a cloned image with an encryption key different from its parent should be supported in the next release of Ceph. When we are able to require a Ceph version >= v18, copy-on-write cloning with ephemeral encryption can be enabled. See https://github.com/ceph/ceph/commit/1d3de19 for reference.

BlockDeviceMapping changes

The BlockDeviceMapping object will be extended to include the following fields encapsulating some of the above information per ephemeral disk within the instance:

encrypted: A simple boolean to indicate if the block device is encrypted. This will initially only be populated when ephemeral encryption is used but could easily be used for encrypted volumes as well in the future.
encryption_secret_uuid: As the name suggests this will contain the UUID of the associated encryption secret for the disk. The type of secret used here will be specific to the encryption format and virt driver used, it should not be assumed that this will always been an symmetric key as is currently the case with all encrypted volumes provided by Cinder. For example, for luks based ephemeral storage this secret will be a passphrase.
encryption_format: A new BlockDeviceEncryptionFormatType enum and associated BlockDeviceEncryptionFormatTypeField field listing the encryption format. The available options being kept in line with the constants currently provided by os-brick and potentially merged in the future if both can share these types and fields somehow.
encryption_options: A simple unversioned dict of strings containing encryption options specific to the virt driver implementation, underlying hypervisor and format being used.

Note

The encryption_options field will be unused and not exposed to end users initially because of the security and upgrade implications around it. For the first pass, sensible defaults for the cipher algorithm, cipher mode, and initialization vector generator algorithm will be hard-coded instead.

Encryption options could be exposed to end users in the future when a proper design which addresses security and handles all upgrade scenarios is developed.

Populate ephemeral encryption BlockDeviceMapping attributes during build

The encryption_format attribute will also take its’ value from the image or flavor if provided. Any differences or conflicts between the image and flavor for this will raise a 409 Conflict error being raised by the API.

Use `COMPUTE_EPHEMERAL_ENCRYPTION` compatibility traits

COMPUTE_EPHEMERAL_ENCRYPTION_LUKS
COMPUTE_EPHEMERAL_ENCRYPTION_LUKSV2
COMPUTE_EPHEMERAL_ENCRYPTION_PLAIN

Introduce an ephemeral encryption request pre-filter

Expose ephemeral encryption attributes via block_device_info

root_device_name: The root device path used by the instance.
ephemerals: A list of DriverEphemeralBlockDevice dict objects detailing the ephemeral disks attached to the instance. Note this does not include the initial image based disk used by the instance that is classified as an ephemeral disk in terms of the ephemeral encryption feature.
block_device_mapping: A list of DriverVol*BlockDevice dict objects detailing the volume based disks attached to the instance.
swap: An optional DriverSwapBlockDevice dict object detailing the swap device.

For example:

{
    "root_device_name": "/dev/vda",
    "ephemerals": [
        {
            "guest_format": null,
            "device_name": "/dev/vdb",
            "device_type": "disk",
            "size": 1,
            "disk_bus": "virtio"
        }
    ],
    "block_device_mapping": [],
    "swap": {
        "swap_size": 1,
        "device_name": "/dev/vdc",
        "disk_bus": "virtio"
    }
}

Report that a disk is encrypted at rest through the metadata API

Extend the metadata API so that users can confirm that their ephemeral storage is encrypted at rest through the metadata API, accessible from within their instance.

{
    "devices": [
        {
            "type": "nic",
            "bus": "pci",
            "address": "0000:00:02.0",
            "mac": "00:11:22:33:44:55",
            "tags": ["trusted"]
        },
        {
            "type": "disk",
            "bus": "virtio",
            "address": "0:0",
            "serial": "12352423",
            "path": "/dev/vda",
            "encrypted": "True"
        },
        {
            "type": "disk",
            "bus": "ide",
            "address": "0:0",
            "serial": "disk-vol-2352423",
            "path": "/dev/sda",
            "tags": ["baz"]
        }
    ]
}

This should also be extended to cover disks provided by encrypted volumes but this is obviously out of scope for this implementation.

Block resize between flavors with different hw:ephemeral_encryption settings

Provide a migration path from the legacy implementation

New nova-manage and nova-status commands will be introduced to migrate any instances using the legacy libvirt virt driver implementation ahead of the removal of this in a future release.

The nova-manage command will ensure that any existing instances with ephemeral_key_uuid set will have their associated BlockDeviceMapping records updated to reference said secret key, the plain encryption format and configured options on the host before clearing ephemeral_key_uuid.

The nova-status command will simply report on the existence of any instances with ephemeral_key_uuid set that do not have the corresponding BlockDeviceMapping attributes enabled etc.

Deprecate the now legacy implementation

The legacy implementation within the libvirt virt driver will be deprecated for removal in a future release once the ability to migrate is in place.

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

See above for the various flavor extra spec, image property, BlockDeviceMapping and DriverBlockDevice object changes.

REST API impact

Flavor extra specs and image property validation will be introduced for the any ephemeral encryption provided options.
Attempts to resize between flavors that differ in their ephemeral encryption options will be rejected.
Attempts to rebuild between images that differ in their ephemeral encryption options will be allowed.
The metadata API will be changed to allow users to determine if their ephemeral storage is encrypted as discussed above.

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Note

Internal base images stored locally in Nova will not be encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

The additional pre-filter will add a small amount of overhead when scheduling instances but this should fail fast if ephemeral encryption is not requested through the image or flavor.

The performance impact of increased use of ephemeral storage encryption by instances is left to be discussed in the virt driver specific specs as this will vary between hypervisors.

Other deployer impact

N/A

Developer impact

Virt driver developers will be able to indicate support for specific ephemeral storage encryption formats using the newly introduced compute compatibility traits.

Upgrade impact

The compute traits should ensure that requests to schedule instances using ephemeral storage encryption with mixed computes (N-1 and N) will work during a rolling upgrade.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Introduce hw_ephemeral_encryption* image properties and hw:ephemeral_encryption flavor extra specs.
Introduce a new encrypted. encryption_secret_uuid, encryption_format and encryption_options attributes to the BlockDeviceMapping Object.
Wire up the new BlockDeviceMapping object attributes through the Driver*BlockDevice layer and block_device_info dict.
Report ephemeral storage encryption through the metadata API.
Introduce new nova-manage and nova-status commands to allow existing users to migrate to this new implementation. This should however be blocked outside of testing until a virt driver implementation is landed.
Validate all of the above in functional tests ahead of any virt driver implementation landing.

Dependencies

None

Testing

At present without a virt driver implementation this will be tested entirely within our unit and functional test suites.

Once a virt driver implementation is available additional integration tests in Tempest and whitebox tests can be written.

Testing of the migration path from the legacy implementation will require an additional grenade job but this will require the libvirt virt driver implementation to be completed first.

Documentation Impact

The new host configurables, flavor extra specs and image properties should be documented.
New user documentation should be written covering the overall use of the feature from a Nova point of view.
Reference documentation around BlockDeviceMapping objects etc should be updated to make note of the new encryption attributes.

References

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
Wallaby	Introduced
Xena	Reproposed
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed
2023.2 Bobcat	Reproposed
2024.1 Caracal	Reproposed

Per Process Healthcheck endpoints

Tue, 03 Oct 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks

Problem description

To monitor the health of a Nova service today requires experience to develop and implement a series of external heuristics to infer the state of the service binaries.

The existing Oslo middleware does not address this problem statement because:

It can only be used by the API and metadata binaries
The middleware does not tell you the service is alive if its hosted by a WSGI server like Apache since the middleware is executed independently from the WSGI application. i.e. the middleware can pass while the nova-api can’t connect to the DB and is otherwise broken.
The Oslo middleware in detailed mode leaks info about the host Python kernel, Python version and hostname which can be used to determine in the host is vulnerable to CVEs which means it should never be exposed to the Internet. e.g.

platform: 'Linux-5.15.2-xanmod1-tt-x86_64-with-glibc2.2.5',
python_version: '3.8.12 (default, Aug 30 2021, 16:42:10) \n[GCC 10.3.0]'

Use Cases

As an operator, I want a simple REST endpoint I can consume to know if a Nova process is healthy.

As an operator I want this health check to not impact the performance of the service so it can be queried frequently at short intervals.

As an operator I would like to be able to use health-check of the Nova API and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

Proposed change

Definitions

TTL: The time interval for which a health check item is valid.

pass: all health indicators are passing and their TTLs have not expired.

warn: any health indicator has an expired TTL or where there is a partial transient failure.

fail: any health indicator is reporting an error or all TTLs are expired.

Warn vs fail

Services in the warn state are still considered healthy in most cases but they may be about to fail soon or be partially degraded.

Code changes

A new top-level Nova health check module will be created to encapsulate the common code and data structure required to implement this feature.

A new health check manager class will be introduced which will maintain the health-check state and all functions related to retrieving, updating and summarizing that state.

The health check manager will be responsible for creating the health check endpoint when it is enabled in the nova.conf and exposing the health check over HTTP.

e.g.

@healthcheck('database', [SQLAlchemyError])
def my_db_func(self):
    pass

@healthcheck('database', [SQLAlchemyError])
def my_other_db_func(self):
    pass

By default all exceptions will be caught and re-raised by the decorator.

If implemented, the etag will be incremented whenever the service state changes and will reset to 0 when the service is restarted.

Example output

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 200 OK
Content-Type: application/health+json
Cache-Control: max-age=3600
Connection: close

{
    "status": "pass",
    "version": "1.0",
    "serviceId": "e3c22423-cd7a-47dc-b6e9-e18d1a8b3bdf",
    "description": "nova-api",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks": {
        "message_bus": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "api_db": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"}
    }
}

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 503 Sevice Unavailable
Content-Type: application/health+json
Cache-Control: no-cache
Connection: close

{
    "status": "fail",
    "version": "1.0",
    "serviceId": "0a47dceb-11b1-4d94-8b9c-927d998be320",
    "description": "nova-compute",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks":{
        "message_bus":{"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "hypervisor":{
             "status": "fail", "time": "2021-12-17T16:05:55+00:00",
             "output": "Libvirt Error: ..."
        }
    }
}

Alternatives

Data model impact

The Nova context object will be extended to store a reference to the health check manager.

REST API impact

None

While this change will expose a new REST API endpoint it will not be part of the existing Nova API.

Security impact

Notifications impact

None

Other end user impact

None

At present, it is not planned to extend the Nova client or the unified client to query the new endpoint. cURL, socat, or any other UNIX socket or TCP HTTP client can be used to invoke the endpoint.

Performance Impact

None

Other deployer impact

A new config section healthcheck will be added in the nova.conf

A uri config option will be introduced to enable the health check functionality. The config option will be a string opt that supports a comma-separated list of URIs with the following format

uri=<scheme>://[host:port|path],<scheme>://[host:port|path]

e.g.

[healthcheck]
uri=tcp://localhost:424242

[healthcheck]
uri=unix:///run/nova/nova-compute.sock

[healthcheck]
uri=tcp://localhost:424242,unix:///run/nova/nova-compute.sock

Developer impact

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: sean-k-mooney
Other contributors:: melwitt

Feature Liaison

Feature liaison:: sean-k-mooney

Work Items

Add new module
Introduce decorator
Extend context object to store a reference to health check manager
Add config options
Expose TCP endpoint
Expose UNIX socket endpoint support
Add docs

Dependencies

None

Testing

This can be tested entirely with unit and functional tests, however, Devstack will be extended to expose the endpoint and use it to determine whether the Nova services have started.

Documentation Impact

The config options will be documented in the config reference and a release note will be added for the feature.

References

Yoga PTG topic:
https://etherpad.opendev.org/p/r.e70aa851abf8644c29c8abe4bce32b81#L415

History

Revisions
Release Name	Description
Yoga	Introduced
2023.1 Antelope	Reproposed
2024.1 Caracal	Reproposed

Ironic Shards

Wed, 27 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/ironic-shards

Note

The series was implemented but eventually reverted due to some bug that was found late. It should be again merged in the next release, ie. 2024.1. That said, we kept the deprecation for the [ironic]\peer_list config option, which was explained below in Config changes and Deprecations.

Problem description

To help with this, Nova has attempted to dynamically spread ironic nodes between a set of nova-compute peers. While this work some of the time, there are some major limitations:

when one nova-compute is down, only unassigned ironic nodes can move to another nova-compute service
i.e. when one nova-compute is down, all ironic nodes with nova instances associated with the down nova-compute service are unable to be managed, i.e. reboot will fail
moreover, when the old nova-compute comes back up, which might take some time, there are lots of bugs as the hash ring slowly rebalances. In part because every nova-compute fetches all nodes, in a large enough cloud, this can take over 24 hours.

This spec is about tweaking the way we shard Ironic compute nodes. We need to stop violating deep assumptions in the compute manager code by moving to a more static ironic node partitions.

Use Cases

Any users of the ironic driver that have more than one nova-compute service per conductor group should move to an active-passive failover mode.

The new static sharding will be of paritcular interest for clouds with ironic conductor groups that are greater than around 1000 baremetal nodes.

Proposed change

We add a new configuration option:

[ironic] shard_key

When we look up a specific ironic node via a node uuid or instance uuid, we should not restrict that to either the shard key or conductor group.

Config changes and Deprecations

We will deprecate the use of the peer_list. We should log a warning when the hash ring is being used, i.e. when it has more than one member added to the hash ring.

In addtion, we need the logic that tries to move Compute Nodes to never work unless the peer_list is larger than one. More details in the data model impact section.

nova-manage move ironic node

We will create a new nova-manage command:

nova-manage ironic-compute-node-move <ironic-node-uuid> \
    --service <destination-service>

This command will do the following:

Find the ComputeNode object for this ironic-node-uuid
Error if the ComputeNode type does not match the ironic driver.
Find the related Service object for the above ComputeNode (i.e. the host)
Error if the service object is not reported as down, and has not also been put into maintanance. We do not require forced down, because we might only be moving a subset of nodes associated with this nova-compute service.
Check the Service object for the destination service host exists
Find all non-deleted instances for this (host,node)
Error if there is more than 1 non-deleted instance found. It is OK if we find zero or 1 instances.
In one DB transaction: move the ComputeNode object to the destination service host and move the Instance (if there is one) to the destination service host

moving from a peer_list to a single nova-compute
moving from peer_list to shard_key, while keeping multiple nova-compute proccesses (for a single conductor group)

Migrate from peer_list to single nova-compute

The process would look something like this:

ironic and nova both default to an empty_shard key by default, such that all ironic nodes are in the same default shard
start a new nova-compute service running the ironic driver, ideally with a syntheic value for [DEFAULT]host e.g. ironic This will log warnings about the need to use the nova-compute migration tool before being able to manage any nodes
stop all existing nova-compute services
mark them as forced-down via the API
Now loop around all ironic nodes and call this, assuming your nova-compute service has its host value of just ironic: nova_manage ironic-compute-node-move <uuid> –service ironic

The periodic tasks in the new nova-compute service will gradually pick up the new ComputeNodes, and will start being able to recieve commands such a reboot for all the moved instances.

While you could start the new nova-compute service after having migrated all the ironic compute nodes, but that would lead to higher downtime during the migration.

Migrate from peer_list to shard_key

The proccess to move from the hash key based peer_list to the static shard_key from ironic is very similar to the above process:

Set the shard_key on all your ironic nodes, such that you can spread the nodes out between your nova-compute processes,
Start your new nova compute processes, one for each shard_key, possibly setting a synthetic [DEFAULT]host value that matches the my_shard_key.
Shutdown all the older nova-compute processs with [ironic]peer_list set
Mark those older services as in maintainance via the Nova API
For each shard_key in Ironic, work out which service host you have mapped each one to above, then run this for each ironic node uuid in the shard: nova_manage ironic-compute-node-move <uuid> –service my_shard_key
Delete the old services via the Nova API, now there are no instances or compute nodes on those services

While you could start the new nova-compute services after the migration, that would lead to a slightly longer downtime.

Adding new compute nodes

In general, there is no change when adding nodes into existing shards.

Similarly, you can add a new nova-compute process for a new shard and then start to fill that up with nodes.

Move an ironic node between shards

When removing nodes from ironic at the end of their life, or adding large numbers of new nodes, you may need to rebalance the shards.

Shutdown the affected nova-compute process
Put nova-compute services into in maintanance
In Ironic API update the shard key on the Ironic node
Now move each ironic node to the correct new nova-compute process for the shard key it was moved into: nova_manage ironic-compute-node-move <uuid> –service my_shard_key
Now unset maintanance mode for the nova-compute, and start that service back up

Move shards between nova-compute services

To move a shard between nova-compute services, you need to replace the nova-compute process with a new one:

ensure the destination nova-compute is configured with the shard you want to move, and is running
stop the nova-compute process currently serving the shard
force-down the service via the API
for each ironic node uuid in the shard call nova-manage to move it to the new nova-compute process

Alternatives

We could consider a list of shard keys, rather than a single shard key per nova-compute. But for this first version, we have chosen the simpler path, that appears to have few limitations.

when nova-compute breaks, its usually the hypervisor hardware that has broken, which includes all the nova servers running on that.
all locking and management of a nova server object is done by the currently assigned nova-compute node, and this is only ever changed by explict move operations like resize, migrate, live-migration and evacuate. As such we can use simple local locks to ensure concurrent operations don’t conflict, along with DB state checking.

Data model impact

A key thing we need to ensure is that ComputeNode objects are only automatically moved between service objects when in legacy hash ring mode. Currently, this only happens for unassigned ComputeNodes.

This is all very related this spec on robustfying the Compute Node and Service object relationship: https://review.opendev.org/c/openstack/nova-specs/+/853837

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

Users will experience a more reliable Ironic and Nova integration.

Performance Impact

It should help users more easily support large ironic deployments integrated with Nova.

Other deployer impact

We will rename the “partition_key” configuration to be expliclity “conductor_group”.

We will deprecate the peer list key. When we start up and see anything set, we ommit a warning about the bugs in using this legacy auto sharding, and recomend moving to the explicit sharding.

There is a new shard_key config, as descirbed above.

There is a new nova_manage CLI command to move Ironic compute nodes on forced-down nova-compute services to a new one.

Developer impact

None

Upgrade impact

For those currenly using peer_list, we need to document how they can move to the new sharding approach.

Implementation

Assignee(s)

Primary assignee:: JayF
Other contributors:: johnthetubaguy

Feature Liaison

Feature liaison: None

Work Items

rename conductor group partition key config
deprecate peer_list config, with warning log messages
add compute node move and delete protections, when peer_list not used
add new shard_key config, limit ironic node list using shard_key
add nova-manage tool to move ironic nodes between compute services
document operational processes around above nova-manage tool

Dependencies

The deprecation of the peer list can happen right away.

But the new sharding depends on the Ironic shard key getting added: https://review.opendev.org/c/openstack/ironic-specs/+/861803

Ideally we add this into Nova after robustify compute node has landed: https://review.opendev.org/c/openstack/nova/+/842478

Testing

We need some functional tests for the nova-manage command to ensure all of the safty guards work as expected.

Documentation Impact

A lot of docs needed for the Ironic driver on the operational procedures around the shard_key.

References

None

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced
2023.2 Bobcat	Re-proposed

Use extend volume completion action

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/assisted-volume-extend

Problem description

In this case, only the QEMU process holding the lock can resize the volume, which can be triggered through the QEMU monitor command block-resize.

There is currently no adequate way for Cinder to use this feature, so the NFS, NetApp NFS, Powerstore NFS, and Quobyte volume drivers all disable extending attached volumes.

Use Cases

Proposed change

Currently, Cinder will send the volume-extended external server event to Nova only after it has finalized the extend operation and reset the volume status from extending back to in-use.

Compute Agent

Nova’s compute agent will use the volume status to differentiate between the two behaviors when handling volume-extended events:

If the volume status is extending, then it will attempt to read extend_new_size from the volume’s metadata and use this value as the new size of the volume, instead of the volume size field.

After successfully extending the volume, it will call the extend volume completion action of the volume, with "error": false.

If anything goes wrong, including extend_new_size being missing from the metadata, or being smaller than the current size of the volume, it will log the error and call the os-extend_volume_completion action with "error": true, so Cinder can roll back the operation.
For any other volume status, including in-use, the event will be handled as before.

API

Nova’s API will introduce a new microversion, so that Cinder can make sure the new behavior is available, before leaving an extend operation unfinished.

Alternatives

A previous change tried to use the volume-extended external server event to support online extend for the NFS driver [1], but did not rely on feedback from Nova to Cinder at all. Instead, it would just set the new size of the volume, change the status back to in-use, notify Nova, and hope for the best.

If anything went wrong on Nova’s side, this would still result in a volume state indicating that the operation was successful, which is not acceptable.
A previous version of this spec proposed a new synchronous API in Nova [2], that would directly call CompVirtAPI.extend_image of the nova-compute instance managing the guest that a volume was attached to. This API would provide a single mechanism to trigger the resize operation, communicate the new size to Nova, and get feedback on the success of the operation.

The problem with a synchronous API is, that RPC and API timeouts limit the maximum time an extend operation can take. For QEMU, this seemed to be acceptable, because storage preallocation is hard disabled for the block-resize command, and because all currently plausible file systems support sparse file operations.

However, this may not be true for other volume or virt drivers that might require this API in the future. It would also break with the established pattern of asynchronous coordination between Nova and Cinder, which includes the assisted snapshot and volume migration features.
Following this pattern, we could make the proposed API asynchronous and use a new callback in Cinder, similar to Nova’s os-assisted-volume-snapshots API, which uses the os-update_snapshot_status snapshot action to provide feedback to Cinder.

The function of the new Nova API would then just be to trigger the operation and to communicate the new size. The question is then, whether that warrants adding a new API to Nova, since there are existing mechanisms that could be used for either.
The existing mechanism for triggering the extend operation in Nova is of course the volume-extended external server event. Using it for this purpose, as this spec proposes, requires the target size to be transferred separately, because external server events only have a single text field that is freely usable, which for volume-extended is already used for the volume ID.

Besides storing it in the admin metadata, as [3] and this spec propose, there is also the option of updating the size field of the volume, as [1] was essentially doing.

This would require the volume size field to be reset on a failure. If an error response from Nova was lost, the volume would just keep the new size. We would need to extend os-reset_status to allow a size reset, or something similar to clean up volumes like this. This would be possible, but updating the size field only after the volume was successfully extended seems like a cleaner solution.
We could also extend the external server event API to accept additional data for events, and use this to communicate the new size to Nova.

This option was judged favorably by reviewers on the previous version of this spec, [2], but it would be a more complex change to the Nova API.

However, if additional data fields become available in a future version of the external server event API, it would be a relatively minor change to use this instead of volume metadata.

Data model impact

None

REST API impact

The behavior of the external server event API will change.

If Nova receives a volume-extended event, and the referenced volume has status of extending, Nova will look for the extend_new_size key in the volume metadata, and use this instead of the volume size field as the target size to update the block device mapping and to pass to the virt driver’s extend_volume method.

Nova will also attempt to call Cinder’s new os-extend_volume_completion volume action proposed in [3] to let Cinder know if the operation was successful or not.
Otherwise, the API will behave as before.

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

Checking the target compute service version allows the API to handle rolling upgrades gracefully.

Implementation

Assignee(s)

Primary assignee:: kgube
Other contributors:: None

Feature Liaison

Feature liaison:: None yet

Work Items

Update the external server event API to check the target compute service version for volume-extended events.
Update the ComputeVirtAPI.extend_volume method to follow the behavior outlined in Compute Agent.
Add unit tests.
Adapt NFS job in the Nova gate to validate online extend.

Dependencies

The extend volume completion action [3]

Testing

We should test that the os-extend_volume_completion gets called correctly in all possible error or success condition if a volume has extending status.

We should test the case that the call to os-extend_volume_completion fails.

We also need to test that volume-extended continues to be handled correctly for volumes not in extending status.

Documentation Impact

The new behavior of the volume-extended event should be added to the documentation of the external server event API.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Accepted
2023.2 Bobcat	Reproposed

VirtIO PackedRing Configuration support

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/virtio-packedring-configuration-support

This blueprint proposes to expose the LibVirt packed option that allows a guest to negotiate support for the VirtIO packed-ring feature. This blueprint is used to solicit community’s input.

Problem description

VM using a Virtio-net paravirtual network device uses Virtual queues (virtqs) to send and recveive data between the virtio-net driver and the virtual or physical backed. The VirtIO standard originally defined a single type of virtq called split-ring queue. The latest edition of the standard (v1.1) adds a different type of the virtq, called packed-ring queue. A different layout of queue elements allows to increase the performance in both virtual and physical backeds.

Qemu added support for the packed virtqs in v4.2 and LibVirt in v6.3. Qemu and LibVirt supports the packed-ring virtqs via the packed option. However, note that this option does not force the VM to use the packed-ring virtq. It acts as a mask, allowing the backed to advertise the support when set. The driver in the VM is still responsible for choosing the layout of virtqs.

Use Cases

As an operator, I want to benefit from the increase in the virtio-net performance, by using a more efficient virtq structure.

Proposed change

Add hw_virtio_packed_ring for image property and hw:virtio_packed_ring for flavor extra specs. Users will control the packed virtqueue feature, and be able to disable it if desired.

hw_virtio_packed_ring=true|false (default false) hw:virtio_packed_ring=true|false (default false)
Provide new compute COMPUTE_NET_VIRTIO_PACKED capablity trait. This trait can be required/forbidden by user. Nova-compute agent will automatically set this trait to the resource provider summary if libvirt version is higher than 6.3
This spec will update scheduling process. ALL_REQUEST_FILTERS will be extended with new filter packed_virtqueue_filter. It will update RequestSpec with new trait in case if image property or flavor extra_spec is enabled to avoid migration to the node without packed virtqueue feature support.

Alternatives

Leave as-is, operator will not have additional performance impact.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

VMs using virtio-net will see an increase in performance. The increase can be anywhere between 10/20% (see DPDK Intel Vhost/virtio perf. reports) and 75% (using Napatech SmartNICs).

Other deployer impact

None

Developer impact

None

Upgrade impact

This spec will update scheduling process. New trait COMPUTE_NET_VIRTIO_PACKED will be set to the resource provider trait list automatically if this feaure is supported on the host.
New Functional and Unit tests will be provided.

Implementation

Assignee(s)

Primary assignee:: justas_napa on IRC and Gerrit

The feature can be implemented by the Napatech devs dvo-plv@napatech.com and obu-plv@napatech.com.

Feature Liaison

Sean Mooney (sean-k-mooney)

Work Items

N/A at this stage.

Dependencies

None

Testing

None

Documentation Impact

Configuration options reference will require an update.

References

VirtIO standard: https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html
LibVirt Domain XML reference https://libvirt.org/formatdomain.html#virtio-related-options

History

Revisions
Release Name	Description
2023.1 Bobcat	Introduced

Allow local scaphandre directory to be mapped to an instance using virtiofs

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/virtiofs-scaphandre

Scaphandre is a tool that can be used to measure compute and VM power consumption down to processes. (https://github.com/hubblo-org/scaphandre)

If you want to know more, the BBC folks proposed an interesting use case to create environmental dashboards: https://superuser.openstack.org/articles/environmental-reporting-dashboards-for-openstack-from-bbc-rd/

Problem description

Currently, this is not possible to get consumption per VM as scaphandre requires a directory on the compute node accessible into running VMs. This directory contains data required by the scaphandre instance (guest agent) running on the VM to correctly reports VM and VM associated processes consumption.

Scaphandre proposed solution to get these data is to mount the directory using virtiofs in the VM. However, the user can not do that, as it requires the VM XML definition file to be modified. Nova fully manages this file, and as a result, only nova can change it.

Use Cases

As a user, I want to know the consumption of my compute node and drill down to VM and VM processes individual consumption.
As an administrator, I want to allow this usage but make sure the user can mount only the configured required directory. I also want not to leak cloud design insights.

Proposed change

To simplify specifications, the feature will be named virtiofs-scaphandre.

Although this feature is implemented to support scaphandre, other tools could require this need. So the implementation will try to be as generic as possible.

This change relies partially on https://specs.openstack.org/openstack/nova-specs/specs/2023.2/approved/libvirt-virtiofs-attach-manila-shares.html specification to build the VM XML file including virtiofs settings (mostly driver part).

This implies the same requirements and limitation.

QEMU >=5.0 and libvirt >= 6.2
Associated instances use file backed memory or huge pages
Live migrate an instance will not be supported as life attach and detach has landed only “recently” in libvirt.

Change description:

Add a compute configuration option share_local_fs that specify mappings between compute source directory and VMs destination mount_tags.

share_local_fs = { "/var/lib/libvirt/scaphandre": "scaphandre" }

If the above configuration option is present starting the compute node, add a compute trait COMPUTE_SHARE_LOCAL_FS specifying the virtiofs-scaphandre feature is available on this compute.
Users can add hw:power_metrics as extra specs or hw_power_metrics image properties, and thus 2 things will happen:
1. Nova will schedule the instance to a host that has share_local_fs.
2. Nova will add the virtiofs settings in the instance XML file as specified by the following example.

<filesystem type='mount' accessmode='passthrough'>
    <driver type='virtiofs'/>
    <source dir='/var/lib/libvirt/scaphandre/<DOMAIN_NAME>'/>
    <target dir='mount_tag'/>
    <readonly />
</filesystem>

Note

The <DOMAIN_NAME> is the name reported by virsh list or OS-EXT-SRV-ATTR:instance_name. This is the common name between qemu process that scaphandre use to get the vm name and openstack.

The instance name can be defined using the instance_name_template. https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.instance_name_template

Example:

“OS-EXT-SRV-ATTR:instance_name”: “instance-00000034”
/usr/bin/qemu-system-x86_64 -name guest=instance-00000034…

As a result, user will be able to mount the compute source directory on his VM using the following command line.

user@instance $ mount -t virtiofs mount_tag /var/scaphandre

Note

The user can see the mount_tag in the instance metadata. Mount automation can be build based on this mechanism.

Alternatives

REST API impact

Data model impact

Introduce hw_powermetrics image property as a new property object.

Extend the flavor extra spec validation to check hw:power_metrics.

Security impact

The compute node filesystem will be shared read-only. This is to prevent any modification on the host by VM users.

Notifications impact

Other end user impact

The scaphandre installation and configuration on compute nodes is left to the openstack administrator.

Performance Impact

Other deployer impact

None

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)

Feature Liaison

Feature liaison:: uggla

Work Items

New configuration option.
Add new trait.
Changes to share the compute node filesystem if requested by an image property or a flavor extra spec.

Dependencies

None

Testing

Functional API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Antelope	Introduced
Bobcat	Reproposed

Cleanup dangling volumes block device mapping

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/cleanup-dangling-volume-attachments

Find out if there are any dangling/unattached volumes in Nova and Cinder database and remove them, if they exists.

Problem description

In case after some volume related operation, volume get detached from instance at but Nova did not get notified and thinks volume is still attached to an instance because volume attachment id is still listed in BDM table of Nova.

This can lead to different issues in functionalities, which required volume details from block_device_mapping table, such as live miration and resizing of instance.

Similarly attachment for instance exists at Cinder side but not in Nova DB.

Use Cases

As an operator, I want all dangling volume attachments safely removed from my instance, as having these attachments in BDM may makes instance goes to error state on instance startup.
As an operator, I want all dangling volume attachments safely removed from my instance, so any volume-related operations do not get affected.
As an admin, I want all dangling attachments listed at Cinder, safely removed from Cinder DB that are claiming to be for the instance.

Proposed change

Notes

To spawn a new instance, Nova retrieves a copy of the base OS image from Glance, now this image is an instance storage, which means if we create any file, it will persist in this storage. Nova creates a BDM for it in the block_device_mapping database with source_type as image and destination_type as local.

Similarly, when we ask Nova to attach volume to an instance, Nova creates a BDM of it in the block_device_mapping database and sets source_type and destination_type as volume.

Changes

While restarting the instance, verify, on the basis of source_type and volume_type, whether the attached BDM is a volume or not, if it is a volume, then verify if this volume exists in Cinder or not. If it exists, verify if its status is “in-use” or “available”. If it’s “in-use”, that means the volume attachment is correct, and both Nova and Cinder are aware of this attachment. If it’s “available” that means the volume is not attached properly to the instance, so remove or soft delete the BDM from the block_device_mapping database.

Also log the update info, so operators can be aware of the reason for this modification in the database.

Code Changes

To delete the BDM’s from the database, we first must need to shutdown the instance, so instance domain get redefined at the virt level. We need to make sure BDM’s updated before generating the new XML.

Hence, this functionality should be added in the instance reboot process. While rebooting, update the block_device_mapping DB at Nova side and volume_attachment DB at Cinder side via Cinder API call. Once after instance shutoff properly, while starting again, at the virt level (such as libvirt) driver module will generate a new XML domain with updated BDM’s.

Functionality _delete_dangling_bdms() should be added inside ComptuteManager and called from ComptuteManager.reboot_instance. It should verify whether target volume BDM source and destination type is not image and local but volume and then if target volume is not listed in Cinder or status of volume at Cinder is ‘available’ and not ‘in-use’ delete the BDM mapping from block_device_mapping table.

Once a dangling volume is found, log a message saying removing stale volume attachments.

Alternatives

A cleanup functionality for Nova-manage utility, which takes instance and remove all dangling volumes from instance.
```
$ nova-manage volume_attachment cleanup <server-id>
```
A cron job which check for each instance in the Nova BDM and Cinder volume_attachment table, if instance has dangling volumes, remove volume entry from table. In this job instance UUID is not required.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

Server might take more time to reboot, as there will be GET and DELETE API call(s) towards Cinder service.

It primarily depends on number of attachments to delete.

Other deployer impact

None

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: auniyal

Feature Liaison

Feature liaison:: None

Work Items

Create a cleanup functionality and add in instance restart process.
Add unit and functional tests for cleanup.

Dependencies

None

Testing

Unit and Functional tests will be added.

Documentation Impact

Releasenote for cleanup dangling volumes while server restart will be added.
Update admin manage volumes doc.

References

None

History

Revisions
Release Name	Description
2023.2 Bobcat	Introduced

Link Compute Objects by ID

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/compute-object-ids

Nova has long had a dependency on an unchanging hostname on the compute nodes. This spec aims to address this limitation, at least from the perspective of being able to detect an accidental change and avoiding catastrophe in the database that can currently result from a hostname change, whether intentional or not.

As a continuation of the effort to robustify compute hostnames, this spec describes the next phase which involves strengthening the linkage between the primary database objects managed by the compute nodes.

Problem description

The ComputeNode, Service, and Instance objects form the primary data model for our compute nodes. Instances run on compute nodes, which are managed by services. We rely on this hierarchy to know where instances are (physically) as well as which RPC endpoint to send messages to for management. Currently, the linkage between all three objects is a relatively loose and string-based, association using the hostname of the compute node and/or the CONF.host values. This not only makes an actual/intentional rename very difficult, but also risks breaking critical links as a result of an accidental one.

Use Cases

As an operator I want an accidental or transient hostname rename to not cause corruption of my Nova data structures.

As a developer, I want a stronger association between the primary objects in the data model for robustness and performance reasons.

Proposed change

We already have a service_id field on our ComputeNode object. We should resume populating that when we create a new ComputeNode and we should fix existing records during ComputeManager.init_host(), similar to how we added checks for hostname discrepancies in the earlier phase of this effort.

We will need to add a compute_id field to the Instance object, which will require a schema migration. This field will need to remain nullable, and will be NULL for instances before scheduling, as well as instances in SHELVED_OFFLOADED state. The compute_id field can be populated at the same time we currently set Instance.node, and similar to ComputeNode records above, we can migrate existing records during ComputeManager._init_instance(). In order to ensure that we keep the node and compute_id fields in sync, the Instance.create() and Instance.update() methods will perform a check to ensure that the former is never changed without the latter also being changed. This check will (by the nature of those two @remotable methods) be run on the conductor nodes, and will only enforce the requirement if the version of the objects is new enough.

Many of the times we update Instance.node, we do so from a Migration object, using either source_node for reverted migrations or dest_node for successful ones. Thus, our handling of migrations will need some work as well, which is described in the subsection below.

It is important to note that this spec defines one part of a two-part effort. The setup described here will require a subsequent step to change how we look up these objects to use the new relationships once all the data has been migrated.

Migration handling

Currently we update Instance.node from a Migration object in a number of places. In most of these, it is being performed on the node where the instance will remain. For those cases, we will get the ComputeNode object from the resource tracker (still by name, from the Migration object) and use it to set the new field. Aside from saving a loosely-coupled DB lookup each time we need it, this has the additional benefit of double-checking that the node specified (loosely, by name) in the Migration object is the (or a) correct one for the current host.

The only place where we currently update Instance.node from a location that is not the host where the Instance is staying is during the early part of resize, where _resize_instance() runs on the sending host with information provided by the destination. In this case, we will modify the Migration object to have one additional dest_compute_id field, which will be filled by the destination host with its known-correct value, to be used by the sending host when it modifies Instance.node (and Instance.compute_id) to be the values for the new host.

Upgrade Concerns

Since the Instance and Migration objects will be growing new fields, older nodes will not be populating these fields when migrating between old and new nodes. In the case of Instance, the compute_id field will not be actually used until a later release when we know it has been populated. The dest_compute_id field in Migration will be used if present, and if not, a fallback to finding the node’s ID will rely on a call to ComputeNode.get_by_host_and_nodename(), which is “easy” since the Migration has all the fields necessary to make that call.

Alternatives

This is not required for proper operation, so we could choose to do nothing.

We could also choose to keep the string-based association, strengthened by Foreign Key relationships.

For the Migration changes, we could also make the destination compute ID be a new RPC parameter that is passed from the destination compute back to the source to avoid needing to change the Migration object. However that brings with it more upgrade concerns.

We could also use the ComputeNode.uuid on the Migration object instead of the ID. There is no real reason to do that because cross-cell migration already creates two migration objects, one per cell. It would also perform worse and would not be a 1:1 mapping of the field we need to set on the instance, which would mean another DB lookup as well.

Data model impact

All changes will be confined to the Cell database:

Instance will grow a compute_id field
Migration will grow a dest_compute_id field
Consistency checks for both of these will need to be added to the object lifecycle operations.
ComputeNode’s existing service_id field will be populated
Both will be populated during new record creation, and for existing records at runtime during nova-compute startup.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

While not the primary intent, a follow-on effort to this will enable querying these objects by integer ID relation instead of by string, which should be both faster as well as lower impact on the database server.

Other deployer impact

No additional deployer impact other than a tiny amount of online data migration traffic on the next startup after upgrade, as well as improved performance and robustness going forward once the effort is completed.

Developer impact

Some additional re-learning about the relationships between the objects being based on IDs instead of hostnames.

Upgrade impact

No real upgrade impact here, other than what is already expected. A simple and database migration will be added, with no specific requirements about ordering or simultaneous code change. Compute nodes will migrate existing records during the first post-upgrade restart.

Implementation

Assignee(s)

Primary assignee:: danms

Work Items

Start populating ComputeNode.service_id on creation
Migrate existing ComputeNode objects on startup (init_host())
Add a migration to add the Instance.compute_id and Migration.dest_compute_id fields
Start populating Migration.dest_compute_id for migrations
Start populating Instance.compute_id on completion of scheduling and migrations.
Migrate existing Instance objects on startup (_init_instance())

Dependencies

None

Testing

Unit and Functional tests will be added to verify that new and existing objects are properly linked and migrated.

Documentation Impact

No documentation changes required.

References

This is part of a larger multi-cycle effort to robustify compute hostnames.
This follows the first robustification stage, completed in 2023.1

History

Revisions
Release Name	Description
2023.2 Bobcat	Introduced

Add maxphysaddr support for Libvirt

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-maxphysaddr-support

This blueprint propose new flavor extra_specs to control the physical address bits of vCPUs in Libvirt guests.

Problem description

When booting a guest with 1TB+ RAM, the default physical address bits are too small and the boot fails [1]. So a knob is needed to specify the appropriate physical address bits.

Use Cases

Booting a guest with large RAM.

Proposed change

<maxphysaddr mode='emulate' bits='42'/>
<maxphysaddr mode='passthrough'/>

Flavor extra_specs

Here I suggest the following two flavor extra_specs. Of course, if these are omitted, the behavior is the same as before.

hw:maxphysaddr_mode can be either emulate or passthrough.
hw:maxphysaddr_bits takes a positive integer value. Only meaningful and must be specified if hw:maxphysaddr_mode=emulate.

Nova scheduler changes

Nova scheduler also needs to be modified to take these two properties into account.

Passthrough and emulate modes have different properties. So let’s consider the two separately.

The case of hw:maxphysaddr_mode=emulate. In nova scheduling, it is necessary to check that the hypervisor supports at least hw:maxphysaddr_bits. The maximum number of bits supported by hypervisor can be obtained by using libvirt capabilities [4]. Therefore, ComputeCapabilitiesFilter can be used to compare the number of bits in scheduling. For example, this can be accomplished by adding capabilities:cpu_info:maxphysaddr:bits>=42 automatically. Cold migration and live migration can also be realized with this filter and COMPUTE_ADDRESS_SPACE_EMULATED trait. So the overall flavor extra_specs look like the following:

openstack flavor set <flavor> \
  --property hw:maxphysaddr_mode=emulate \
  --property hw:maxphysaddr_bits=42

Note

Since ComputeCapabilitiesFilter only supports flavor extra_specs and not image properties [5], this proposal is out of scope for image properties.

Alternatives

Before the maxphysaddr option was introduced into Libvirt, it was specified as a workaround with the QEMU comanndline parameter. But this alternative is not allowed in nova.

Also, some Linux distributions may have machine types with host-phys-bits=true [6]. For example, pc-i440fx-bionic-hpb and pc-q35-bionic-hpb. However, this alternative has following two issues and cannot be adopted for general-purpose use cases.

Ubuntu package maintainers are applying a patch to QEMU [7]. It means this is not included in vanilla QEMU and is not available in other distributions.
This is only the case for hw:maxphysaddr_mode=passthrough and does not include hw:maxphysaddr_mode=emulate. Since hw:maxphysaddr_mode=passthrough requires cpu_mode=host-passthrough to be used [8], this alternative cannot be used with cpu_mode=custom or cpu_mode=host-model. So, this alternative is not sufficient for a cloud with many different CPU models.

As for scheduling, placement does not currently support numeric traits, so the maximum number of bits supported by hypervisor cannot be checked by this mechanism.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Operators should specify appropriate flavor extra_specs as needed.

Developer impact

None

Upgrade impact

As described earlier, the new traits COMPUTE_ADDRESS_SPACE_PASSTHROUGH and COMPUTE_ADDRESS_SPACE_EMULATED signal if the upgraded compute nodes support this feature.

Implementation

Assignee(s)

Primary assignee:: nmiki
Other contributors:: None

Feature Liaison

Feature liaison:: Liaison Needed

Work Items

Add new guest configs
Add new fileds in nova/api/validation/extra_specs/hw.py
Add new fields in LibvirtConfigCPU in nova/virt/livbirt/config.py
Add new traits to check Libvirt and QEMU versions
Add new field maxphysaddr to cpu_info in nova/virt/libvirt/driver.py
Add docs and release notes for new flavor extra_specs

Dependencies

Libivrt v8.7.0+. QEMU v2.7.0+.

Testing

Add the following unit tests:

check that proposed flavor extra_specs are properly validated
check that intended XML elements are output
check that traits are properly added and used
check that new field in ComputeCapabilitiesFilter is property added and used

Documentation Impact

For operators, the documentation describes what proposed flavor extra_specs mean and how they should be set.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced
2023.2 Bobcat	Reproposed

Tooling and Docs for Unified Limits

Tue, 19 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/unified-limits-nova-tool-and-docs

In the Yoga release support for Unified Limits was added in Nova as an experimental feature to get early feedback and fix issues that were found by operators trying it out. Now that a few releases have passed, we want to go ahead and formalize the unified limits quota driver by creating a tool to help operators copy their existing legacy quota limits from Nova to unified limits in Keystone, publish official documentation in the Nova quota documentation, and removing the note on the [quota]driver=nova.quota.UnifiedLimitsDriver config option indicating its experimental status.

Note

There are no immediate plans to deprecate legacy quota system in Nova at this time. The objective of this work is to provide a better experience for users who are opting in to using unified limits in Nova.

Problem description

Currently there is no documentation in the Nova docs about unified limits and there isn’t any automated tool for generating unified limits in Keystone from existing Nova legacy quota limits.

Use Cases

As an operator, I would like to use a tool to automatically copy my existing legacy quota limits from Nova to unified limits in Keystone.
As an operator, I would like formal documentation for unified limits quotas to be available.

Proposed change

We propose to create an automated tool, for example, nova-manage limits migrate_to_unified_limits that will read existing legacy quota limits from the Nova database and config options and create equivalent unified limits for them in Keystone using the Keystone REST API. It will be able to migrate both default limits and project-scoped limits. It will not migrate user-scoped limits as they are not supported by unified limits.

The nova-manage command will follow the precedence for checking quota and:

Check the nova_api.quotas database table and for each row call the Keystone POST /limits API with the project_id, resource name, and resource_limit. These are the project-scoped limits.
Check the nova_api.quota_classes database table to see if there are rows with class_name default. If there are, for each row with class_name default call the Keystone POST /registered_limits API with the resource_name and default_limit. These are the default limits that apply to all projects.
Check the following config options:
```
[quota]
instances
cores
ram
metadata_items
injected_files
injected_file_content_bytes
injected_file_path_length
key_pairs
server_groups
server_group_members
```
For each config option, use its set value or default value to call the Keystone POST /registered_limits API with the resource_name and default_limit, if the resource_name does not already have a registered limit in Keystone. These are default limits that apply to all projects.
The nova_api.project_user_quotas database table will be ignored because user-scoped limits are not supported by unified limits.

We will add formal docs about unified limits to the Nova docs and remove the note on the [quota]driver config option about the nova.quota.UnifiedLimitsDriver being in a development state.

Alternatives

Operators can create unified limits using the openstack limit openstack client commands without a provided tool.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

End users will be able to read documentation about how quotas work with unified limits.

Performance Impact

None

Other deployer impact

Deployers will have the option of using the quota limit migration tool to copy existing legacy Nova quota limits into Keystone unified limits instead of using openstackclient commands or otherwise calling the Keystone REST API manually.

Developer impact

None

Upgrade impact

There is no upgrade impact with the quota limit migrate tool in that there is no restriction on when operators can run the tool. They can copy quota limits into Keystone at any time, unrelated to an upgrade. The only requirements are that the Keystone API needs to be available and nova-manage must have access to a Nova config that has [api_database]connection configured so that it can access the Nova quota database tables.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: None

Feature Liaison

Feature liaison:: melwitt

Work Items

Develop a nova-manage limits migrate_to_unified_limits command to copy existing legacy Nova quota limits from the Nova database and config options to unified limits by calling the Keystone REST API
Write documentation for unified limits in Nova
Remove note from [quota]driver=nova.quota.UnifiedLimitsDriver about the driver being in a development state
Collaborate with Keystone team to remove the docs warning in https://docs.openstack.org/keystone/latest/admin/unified-limits.html about the unified limits API labeled as experimental

Dependencies

https://specs.openstack.org/openstack/nova-specs/specs/yoga/implemented/unified-limits-nova.html

Testing

Unit and/or functional testing for the quota limit migrate tool wil be added.

We can also test the quota limit migrate tool alongside the existing nova/tools/hooks/post_test_hook.sh unified limits coverage in the nova-next CI job.

Documentation Impact

Operators will be most affected by the addition of Nova unified limits documentation. The following docs will need to be updated:

References

History

Revisions
Release Name	Description
2023.2 Bobcat	Introduced

Use extend volume completion action

Mon, 18 Sep 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/assisted-volume-extend

Problem description

In this case, only the QEMU process holding the lock can resize the volume, which can be triggered through the QEMU monitor command block-resize.

There is currently no adequate way for Cinder to use this feature, so the NFS, NetApp NFS, Powerstore NFS, and Quobyte volume drivers all disable extending attached volumes.

Use Cases

Proposed change

Currently, Cinder will send the volume-extended external server event to Nova only after it has finalized the extend operation and reset the volume status from extending back to in-use.

Compute Agent

Nova’s compute agent will use the volume status to differentiate between the two behaviors when handling volume-extended events:

If the volume status is extending, then it will attempt to read extend_new_size from the volume’s metadata and use this value as the new size of the volume, instead of the volume size field.

After successfully extending the volume, it will call the extend volume completion action of the volume, with "error": false.

If anything goes wrong, including extend_new_size being missing from the metadata, or being smaller than the current size of the volume, it will log the error and call the os-extend_volume_completion action with "error": true, so Cinder can roll back the operation.
For any other volume status, including in-use, the event will be handled as before.

API

Nova’s API will introduce a new microversion, so that Cinder can make sure the new behavior is available, before leaving an extend operation unfinished.

Alternatives

A previous change tried to use the volume-extended external server event to support online extend for the NFS driver [1], but did not rely on feedback from Nova to Cinder at all. Instead, it would just set the new size of the volume, change the status back to in-use, notify Nova, and hope for the best.

If anything went wrong on Nova’s side, this would still result in a volume state indicating that the operation was successful, which is not acceptable.
A previous version of this spec proposed a new synchronous API in Nova [2], that would directly call CompVirtAPI.extend_image of the nova-compute instance managing the guest that a volume was attached to. This API would provide a single mechanism to trigger the resize operation, communicate the new size to Nova, and get feedback on the success of the operation.

The problem with a synchronous API is, that RPC and API timeouts limit the maximum time an extend operation can take. For QEMU, this seemed to be acceptable, because storage preallocation is hard disabled for the block-resize command, and because all currently plausible file systems support sparse file operations.

However, this may not be true for other volume or virt drivers that might require this API in the future. It would also break with the established pattern of asynchronous coordination between Nova and Cinder, which includes the assisted snapshot and volume migration features.
Following this pattern, we could make the proposed API asynchronous and use a new callback in Cinder, similar to Nova’s os-assisted-volume-snapshots API, which uses the os-update_snapshot_status snapshot action to provide feedback to Cinder.

The function of the new Nova API would then just be to trigger the operation and to communicate the new size. The question is then, whether that warrants adding a new API to Nova, since there are existing mechanisms that could be used for either.
The existing mechanism for triggering the extend operation in Nova is of course the volume-extended external server event. Using it for this purpose, as this spec proposes, requires the target size to be transferred separately, because external server events only have a single text field that is freely usable, which for volume-extended is already used for the volume ID.

Besides storing it in the admin metadata, as [3] and this spec propose, there is also the option of updating the size field of the volume, as [1] was essentially doing.

This would require the volume size field to be reset on a failure. If an error response from Nova was lost, the volume would just keep the new size. We would need to extend os-reset_status to allow a size reset, or something similar to clean up volumes like this. This would be possible, but updating the size field only after the volume was successfully extended seems like a cleaner solution.
We could also extend the external server event API to accept additional data for events, and use this to communicate the new size to Nova.

This option was judged favorably by reviewers on the previous version of this spec, [2], but it would be a more complex change to the Nova API.

However, if additional data fields become available in a future version of the external server event API, it would be a relatively minor change to use this instead of volume metadata.

Data model impact

None

REST API impact

The behavior of the external server event API will change.

If Nova receives a volume-extended event, and the referenced volume has status of extending, Nova will look for the extend_new_size key in the volume metadata, and use this instead of the volume size field as the target size to update the block device mapping and to pass to the virt driver’s extend_volume method.

Nova will also attempt to call Cinder’s new os-extend_volume_completion volume action proposed in [3] to let Cinder know if the operation was successful or not.
Otherwise, the API will behave as before.

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

Checking the target compute service version allows the API to handle rolling upgrades gracefully.

Implementation

Assignee(s)

Primary assignee:: kgube
Other contributors:: None

Feature Liaison

Feature liaison:: None yet

Work Items

Update the external server event API to check the target compute service version for volume-extended events.
Update the ComputeVirtAPI.extend_volume method to follow the behavior outlined in Compute Agent.
Add unit tests.
Adapt NFS job in the Nova gate to validate online extend.

Dependencies

The extend volume completion action [3]

Testing

We should test that the os-extend_volume_completion gets called correctly in all possible error or success condition if a volume has extending status.

We should test the case that the call to os-extend_volume_completion fails.

We also need to test that volume-extended continues to be handled correctly for volumes not in extending status.

Documentation Impact

The new behavior of the volume-extended event should be added to the documentation of the external server event API.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Accepted
2023.2 Bobcat	Accepted
2024.1 Caracal	Reproposed

Example Spec - The title of your blueprint

Fri, 21 Jul 2023 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2024.1 Caracal	Introduced

Flavour and Image defined ephemeral storage encryption

Thu, 06 Jul 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-storage-encryption

Note

This spec will only cover the high level changes to the API and compute layers, implementation within specific virt drivers is left for separate specs.

Problem description

Use Cases

As a user I want to request that all of my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As a user I want to be able to pick how my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As an admin/operator I want to either enforce ephemeral encryption per flavor or per image.
As an admin/operator I want to provide sane choices to my end users regarding how their ephemeral storage is encrypted at rest.
As a virt driver maintainer/developer I want to indicate that my driver supports ephemeral storage encryption using a specific encryption format.
As a virt driver maintainer/developer I want to provide sane default encryption format and options for users looking to encrypt their ephemeral storage at rest. I want these associated with the encrypted storage until it is deleted.

Proposed change

To enable this new flavor extra specs, image properties and host configurables will be introduced. These will control when and how ephemeral storage encryption at rest is enabled for an instance.

Note

Separate image properties have been documented in the Glance image encryption and Cinder image encryption specs to cover how images can be encrypted at rest within Glance.

Allow ephemeral encryption to be configured by flavor, image or config

To enable ephemeral encryption per instance the following boolean based flavor extra spec and image property will be introduced:

hw:ephemeral_encryption
hw_ephemeral_encryption

The encryption format used will be controlled by the following flavor extra specs and image properties:

hw:ephemeral_encryption_format
hw_ephemeral_encryption_format

[ephemeral_storage_encryption]/default_format

The format will be provided as a string that maps to a BlockDeviceEncryptionFormatTypeField oslo.versionedobjects field value:

plain for the plain dm-crypt format
luks for the LUKSv1 format

hw_ephemeral_encryption_secret_uuid

The secret UUID is needed when creating an instance from an ephemeral encrypted snapshot or when unshelving an ephemeral encrypted instance.

Create a new key manager secret for every new encrypted disk image

For example:

Let’s say Instance A has 3 disks: one root disk, one ephemeral disk, and one swap disk. Each disk will have its own secret.

This table is intended to illustrate the way secrets are handled in various scenarios.

+--------------------+-------------+--------------+------------------------------------------------------+
| Instance or Image  | Disk        | Secret       | Notes                                                |
|                    |             | (passphrase) |                                                      |
+====================+=============+==============+======================================================+
| Instance A         | disk (root) | Secret 1     | Secret 1, 2, and 3 will be automatically deleted     |
|                    +-------------+--------------+ by Nova when Instance A is deleted and its disks are |
|                    | disk.eph0   | Secret 2     | destroyed                                            |
|                    +-------------+--------------+                                                      |
|                    | disk.swap   | Secret 3     |                                                      |
+--------------------+-------------+--------------+------------------------------------------------------+
| Image Z (snapshot) | disk (root) | Secret 4     | Secret 4 will *not* be automatically deleted and     |
| created from       |             | (new secret  | manual deletion will be needed if/when Image Z is    |
| Instance A         |             |  is created) | deleted from Glance                                  |
+--------------------+-------------+--------------+------------------------------------------------------+
| Instance B         | disk (root) | Secret 5     | Secret 5, 6, and 7 will be automatically deleted     |
| created from       +-------------+--------------+ by Nova when Instance B is deleted and its disks are |
| Image Z (snapshot) | disk.eph0   | Secret 6     | destroyed                                            |
|                    +-------------+--------------+                                                      |
|                    | disk.swap   | Secret 7     |                                                      |
+--------------------+-------------+--------------+------------------------------------------------------+
| Instance C         | disk (root) | Secret 8     | Secret 8, 9, and 10 will be automatically deleted    |
|                    +-------------+--------------+ by Nova when Instance C is deleted and its disks are |
|                    | disk.eph0   | Secret 9     | destroyed                                            |
|                    +-------------+--------------+                                                      |
|                    | disk.swap   | Secret 10    |                                                      |
+--------------------+-------------+--------------+------------------------------------------------------+
| Image Y (snapshot) | disk (root) | Secret 8     | Secret 8 is *retained* when Instance C is shelved in |
| created by shelve  |             |              | part to prevent the possibility of a change in       |
| of Instance C      |             |              | ownership of the root disk secret if, for example,   |
|                    |             |              | an admin user shelves a non-admin user's instance.   |
|                    |             |              | This approach could be avoided if there is some way  |
|                    |             |              | we could create a new secret using the instance's    |
|                    |             |              | user/project rather than the shelver's user/project  |
+--------------------+-------------+--------------+------------------------------------------------------+
| Rescue disk        | disk (root) | Secret 11    | Secret 11 is stashed in the instance's system        |
| created by rescue  |             | (new secret  | metadata with key                                    |
| of Instance A      |             |  is created) | ``rescue_disk_ephemeral_encryption_secret_uuid``.    |
|                    |             |              | This is done because a BDM record for the rescue     |
|                    |             |              | disk is not going to be persisted to the database.   |
+--------------------+-------------+--------------+------------------------------------------------------+

Snapshots of instances with ephemeral encryption

Snapshots created by shelving instances with ephemeral encryption

This behavior could be avoided however if there is some way we could create a new encryption secret using the instance’s user and project rather than the shelver’s user and project. If that is possible, we would not need to reuse the encryption secret.

Rescue disk images created by rescuing instances with ephemeral encryption

The corresponding virt driver secret name pattern is <instance UUID>_rescue_disk and any existing secrets with that name are deleted by the virt driver when a new rescue is requested.

The new encryption secret for the rescue disk is deleted from the key manager and the virt driver secret is also deleted when the instance is unrescued.

Cleanup of ephemeral encryption secrets

Note

BlockDeviceMapping changes

The BlockDeviceMapping object will be extended to include the following fields encapsulating some of the above information per ephemeral disk within the instance:

encrypted: A simple boolean to indicate if the block device is encrypted. This will initially only be populated when ephemeral encryption is used but could easily be used for encrypted volumes as well in the future.
encryption_secret_uuid: As the name suggests this will contain the UUID of the associated encryption secret for the disk. The type of secret used here will be specific to the encryption format and virt driver used, it should not be assumed that this will always been an symmetric key as is currently the case with all encrypted volumes provided by Cinder. For example, for luks based ephemeral storage this secret will be a passphrase.
encryption_format: A new BlockDeviceEncryptionFormatType enum and associated BlockDeviceEncryptionFormatTypeField field listing the encryption format. The available options being kept in line with the constants currently provided by os-brick and potentially merged in the future if both can share these types and fields somehow.
encryption_options: A simple unversioned dict of strings containing encryption options specific to the virt driver implementation, underlying hypervisor and format being used.

Note

Encryption options could be exposed to end users in the future when a proper design which addresses security and handles all upgrade scenarios is developed.

Populate ephemeral encryption BlockDeviceMapping attributes during build

Use `COMPUTE_EPHEMERAL_ENCRYPTION` compatibility traits

COMPUTE_EPHEMERAL_ENCRYPTION_LUKS
COMPUTE_EPHEMERAL_ENCRYPTION_LUKSV2
COMPUTE_EPHEMERAL_ENCRYPTION_PLAIN

Introduce an ephemeral encryption request pre-filter

Expose ephemeral encryption attributes via block_device_info

root_device_name: The root device path used by the instance.
ephemerals: A list of DriverEphemeralBlockDevice dict objects detailing the ephemeral disks attached to the instance. Note this does not include the initial image based disk used by the instance that is classified as an ephemeral disk in terms of the ephemeral encryption feature.
block_device_mapping: A list of DriverVol*BlockDevice dict objects detailing the volume based disks attached to the instance.
swap: An optional DriverSwapBlockDevice dict object detailing the swap device.

For example:

{
    "root_device_name": "/dev/vda",
    "ephemerals": [
        {
            "guest_format": null,
            "device_name": "/dev/vdb",
            "device_type": "disk",
            "size": 1,
            "disk_bus": "virtio"
        }
    ],
    "block_device_mapping": [],
    "swap": {
        "swap_size": 1,
        "device_name": "/dev/vdc",
        "disk_bus": "virtio"
    }
}

Report that a disk is encrypted at rest through the metadata API

Extend the metadata API so that users can confirm that their ephemeral storage is encrypted at rest through the metadata API, accessible from within their instance.

{
    "devices": [
        {
            "type": "nic",
            "bus": "pci",
            "address": "0000:00:02.0",
            "mac": "00:11:22:33:44:55",
            "tags": ["trusted"]
        },
        {
            "type": "disk",
            "bus": "virtio",
            "address": "0:0",
            "serial": "12352423",
            "path": "/dev/vda",
            "encrypted": "True"
        },
        {
            "type": "disk",
            "bus": "ide",
            "address": "0:0",
            "serial": "disk-vol-2352423",
            "path": "/dev/sda",
            "tags": ["baz"]
        }
    ]
}

This should also be extended to cover disks provided by encrypted volumes but this is obviously out of scope for this implementation.

Block resize between flavors with different hw:ephemeral_encryption settings

Provide a migration path from the legacy implementation

New nova-manage and nova-status commands will be introduced to migrate any instances using the legacy libvirt virt driver implementation ahead of the removal of this in a future release.

The nova-manage command will ensure that any existing instances with ephemeral_key_uuid set will have their associated BlockDeviceMapping records updated to reference said secret key, the plain encryption format and configured options on the host before clearing ephemeral_key_uuid.

The nova-status command will simply report on the existence of any instances with ephemeral_key_uuid set that do not have the corresponding BlockDeviceMapping attributes enabled etc.

Deprecate the now legacy implementation

The legacy implementation within the libvirt virt driver will be deprecated for removal in a future release once the ability to migrate is in place.

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

See above for the various flavor extra spec, image property, BlockDeviceMapping and DriverBlockDevice object changes.

REST API impact

Flavor extra specs and image property validation will be introduced for the any ephemeral encryption provided options.
Attempts to resize between flavors that differ in their ephemeral encryption options will be rejected.
Attempts to rebuild between images that differ in their ephemeral encryption options will be allowed.
The metadata API will be changed to allow users to determine if their ephemeral storage is encrypted as discussed above.

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Note

Internal base images stored locally in Nova will not be encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

The additional pre-filter will add a small amount of overhead when scheduling instances but this should fail fast if ephemeral encryption is not requested through the image or flavor.

The performance impact of increased use of ephemeral storage encryption by instances is left to be discussed in the virt driver specific specs as this will vary between hypervisors.

Other deployer impact

N/A

Developer impact

Virt driver developers will be able to indicate support for specific ephemeral storage encryption formats using the newly introduced compute compatibility traits.

Upgrade impact

The compute traits should ensure that requests to schedule instances using ephemeral storage encryption with mixed computes (N-1 and N) will work during a rolling upgrade.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Introduce hw_ephemeral_encryption* image properties and hw:ephemeral_encryption flavor extra specs.
Introduce a new encrypted. encryption_secret_uuid, encryption_format and encryption_options attributes to the BlockDeviceMapping Object.
Wire up the new BlockDeviceMapping object attributes through the Driver*BlockDevice layer and block_device_info dict.
Report ephemeral storage encryption through the metadata API.
Introduce new nova-manage and nova-status commands to allow existing users to migrate to this new implementation. This should however be blocked outside of testing until a virt driver implementation is landed.
Validate all of the above in functional tests ahead of any virt driver implementation landing.

Dependencies

None

Testing

At present without a virt driver implementation this will be tested entirely within our unit and functional test suites.

Once a virt driver implementation is available additional integration tests in Tempest and whitebox tests can be written.

Testing of the migration path from the legacy implementation will require an additional grenade job but this will require the libvirt virt driver implementation to be completed first.

Documentation Impact

The new host configurables, flavor extra specs and image properties should be documented.
New user documentation should be written covering the overall use of the feature from a Nova point of view.
Reference documentation around BlockDeviceMapping objects etc should be updated to make note of the new encryption attributes.

References

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
Wallaby	Introduced
Xena	Reproposed
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed
2023.2 Bobcat	Reproposed

Handling Reshaped Provider Trees

Wed, 05 Jul 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree

Virt drivers need to be able to change the structure of the provider trees they expose. When moving existing resources, existing allocations need to be moved along with the inventories. And this must be done in such a way as to avoid races where a second entity can create or remove allocations against the moving inventories.

Problem description

Use Cases

The libvirt driver currently inventories VGPU resources on the compute node provider. In order to exploit provider trees, libvirt needs to create one child provider per physical GPU and move the VGPU inventory from the compute node provider to these GPU child providers. In a live deployment where VGPU resources are already allocated to instances, the allocations need to be moved along with the inventories.
Drivers wishing to model NUMA must similarly create child providers and move inventory and allocations of several classes (processor, memory, VFs on NUMA-affined NICs, etc.) to those providers.
A driver is using a custom resource class. That resource class is added to the standard set (under a new, non-CUSTOM_ name). In order to use the standard name, the driver must move inventory and allocations from the old name to the new.

These are just example cases that may exist now or in the future. We’re describing a generic pivot system here.

Proposed change

The overall flow is as follows. The parts in red only happen when a reshape is needed. This represents the happy path on compute startup only.

Note that, for Fast-Forward Upgrades, the Resource Tracker lane is actually the Offline Upgrade Script.

SchedulerReportClient.get_allocations_for_provider_tree()

A new SchedulerReportClient method shall be implemented:

def get_allocations_for_provider_tree(self):
    """Retrieve allocation records associated with all providers in the
    provider tree.

    :returns: A dict, keyed by consumer UUID, of allocation records.
    """

A consumer isn’t always an instance (it may be a “migration” - or other things not created by Nova, in the future), so we can’t just use the instance list as the consumer list.

We can’t get all allocations for associated sharing providers because some of those will belong to consumers on other hosts.

So we have to discover all the consumers associated with the providers in the local tree:

for each "local" provider:
    GET /resource_providers/{provider.uuid}/allocations

We can’t use just those allocations because we would miss allocations for sharing providers. So we have to get all the allocations for just the consumers discovered above:

for each consumer in ^:
    GET /allocations/{consumer.uuid}

Note

We will still miss data if all of a consumer’s allocations live on sharing providers. I don’t have a good way to close that hole. But that scenario won’t happen in the near future, so it’ll be noted as a limitation via a code comment.

Return a dict, keyed by the {consumer.uuid}, of the resulting allocation records. This is the form of the new Allocations Parameter expected by update_provider_tree() and update_from_provider_tree()), and return it.

ReshapeNeeded exception

A new exception, ReshapeNeeded, will be introduced. It is used as a signal from update_provider_tree() to indicate that a reshape must be performed. This is for performance reasons so that we don’t get_allocations_for_provider_tree() unless it’s necessary.

Changes to update_provider_tree()

Allocations Parameter

A new allocations keyword argument will be added to update_provider_tree():

def update_provider_tree(self, provider_tree, nodename, allocations=None):

If None, the upgrade_provider_tree() method must not perform a reshape. If it decides a reshape is necessary, it must raise the new ReshapeNeeded exception.

When not None, the allocations argument is a dict, keyed by consumer UUID, of allocation records of the form:

{ $CONSUMER_UUID: {
      # NOTE: The shape of each "allocations" dict below is identical to the
      # return from GET /allocations/{consumer_uuid}...
      "allocations": {
          $RP_UUID: {
              "generation": $RP_GEN,
              "resources": {
                  $RESOURCE_CLASS: $AMOUNT,
                  ...
              },
          },
          ...
      },
      "project_id": $PROJ_ID,
      "user_id": $USER_ID,
      # ...except for this, which is coming in bp/add-consumer-generation
      "consumer_generation": $CONSUMER_GEN,
  },
  ...
}

If update_provider_tree() is moving allocations, it must edit the allocations dict in place.

Note

I don’t love the idea of the method editing the dict in place rather than returning a copy, but it’s consistent with how we’re handling the provider_tree arg.

Virt Drivers

Virt drivers currently overriding update_provider_tree() will need to change the signature to accomodate the new parameter. That work will be done within the scope of this blueprint.

As virt drivers begin to model resources in nested providers, their implementations will need to:

determine whether a reshape is necessary and raise ReshapeNeeded as appropriate;
perform the reshape by processing provider inventories and the specified allocations.

That work is outside the scope of this blueprint.

Changes to update_from_provider_tree()

The SchedulerReportClient.update_from_provider_tree() method is changed to accept a new parameter allocations:

def update_from_provider_tree(self, context, new_tree, allocations):
    """Flush changes from a specified ProviderTree back to placement.

    ...

    ...
    :param allocations: A dict, keyed by consumer UUID, of allocation records
            of the form returned by GET /allocations/{consumer_uuid}. The
            dict must represent the comprehensive final picture of the
            allocations for each consumer therein. A value of None indicates
            that no reshape is being performed.
    ...
    """

When allocations is None, the behavior of update_from_provider_tree() is as it was previously (in Queens).

Changes to Resource Tracker _update()

The _update() method will get a new parameter, startup, which is percolated down from update_available_resource().

Where update_provider_tree() and update_from_provider_tree() are currently invoked, the code flow will be changed to approximately:

try:
    self.driver.update_provider_tree(prov_tree, nodename)
except exception.ReshapeNeeded:
    if not startup:
        # Treat this like a regular exception during periodic
        raise
    LOG.info("Performing resource provider inventory and "
             "allocation data migration during compute service "
             "startup or FFU.")
    allocs = reportclient.get_allocations_for_provider_tree()
    self.driver.update_provider_tree(prov_tree, nodename,
                                     allocations=allocs)
...
reportclient.update_from_provider_tree(context, prov_tree, allocs)

Changes to _update_available_resource_for_node()

This is currently where all exceptions for the Resource Tracker _update() periodic task are caught, logged, and otherwise ignored.

We will add a new parameter, startup, percolated down from update_available_resource(), and a new except clause of the form:

except exception.ResourceProviderUpdateFailed:
    if startup:
        # Kill the compute service.
        raise
    # Else log a useful exception reporting what happened and maybe even how
    # to fix it; and then carry on.

The purpose of this is to make exceptions in update_from_provider_tree() fatal on startup only.

Placement POST /reshaper

In a new placement microversion, a new POST /reshaper operation will be introduced. The payload is of the form:

{
  "inventories": [
    $RP_UUID: {
      # This is the exact payload format for
      # PUT /resource_provider/$RP_UUID/inventories.
      # It should represent the final state of the entire set of resources
      # for this provider. In particular, omitting a $RC dict will cause the
      # inventory for that resource class to be deleted if previously present.
      "inventories": { $RC: { <total, reserved, etc.> } }
      "resource_provider_generation": <gen of this RP>,
    },
    $RP_UUID: { ... },
  ],
  "allocations": [
    # This is the exact payload format for POST /allocations
    $CONSUMER_UUID: {
      "project_id": $PROJ_ID,
      "user_id": $USER_ID,
      # This field is part of the consumer generation series under review,
      # not yet in the published POST /allocations payload.
      "consumer_generation": $CONSUMER_GEN,
      "allocations": {
        $RP_UUID: {
          "resources": { $RC: $AMOUNT, ... }
        },
        $RP_UUID: { ... }
      }
    },
    $CONSUMER_UUID: { ... }
  ]
}

In a single atomic transaction, placement replaces the inventories for each $RP_UUID in the inventories dict; and replaces the allocations for each $CONSUMER_UUID in the allocations dict.

Return values:

204 No Content on success.
409 Conflict on any provider or consumer generation conflict; or if a concurrent transaction is detected. Appropriate error codes should be used for at least the former so the caller can tell whether a fresh GET is necessary before recalculating the necessary reshapes and retrying the operation.
400 Bad Request on any other failure.

Direct Interface to Placement

To make the Offline Upgrade Script possible, we need to make placement accessible by importing Python code rather than as a standalone service. The quickest path forward is to use wsgi-intercept to allow HTTP interactions, using the requests library, to work with only database traffic going over the network. This allows client code to make changes to the placement data store using the same API, but without running a placement service.

An implementation of this, as a context manager called PlacementDirect, is merged. The context manager accepts an oslo config, populated by the caller. This allows the calling code to control how it wishes to discover configuration settings, most importantly the database being used by placement.

This implementation provides a quick solution to the immediate needs of offline use of Placement POST /reshaper while allowing options for prettier solutions in the future.

Offline Upgrade Script

To facilitate Fast Forward Upgrades, we will provide a script that can perform this reshaping while all services (except databases) are offline. It will look like:

nova-manage placement migrate_compute_inventory

…and operate as follows, for each nodename (one, except for ironic) on the host:

Spin up a SchedulerReportClient with a Direct Interface to Placement.
Retrieve a ProviderTree via SchedulerReportClient.get_provider_tree_and_ensure_root().
Instantiate the appropriate virt driver.
Perform the algorithm noted in Resource Tracker _update(), as if startup is True.

We may refer to https://review.openstack.org/#/c/501025/ for an example of an upgrade script that requires a virt driver.

Alternatives

Reshaper API

Alternatives to Placement POST /reshaper were discussed in the mailing list thread, the etherpad, IRC, hangout, etc. They included:

Don’t have an atomic placement operation - do the necessary operations one at a time from the resource tracker. Rejected due to race conditions: the scheduler can schedule against the moving inventories, based on incorrect capacity information due to the moving allocations.
“Lock” the moving inventories - either by providing a locking API or by setting reserved = total - while the resource tracker does the reshape. Rejected because it’s a hack; and because recovery from partial failures would be difficult.
“Merge” forms of the new placement operation:
- PATCH (or POST) with RFC 6902-style "operation", "path"[, "from", "value"] instructions.
- PATCH (or POST) with RFC 7396 semantics. The JSON payload would look like a sparse version of that described in Placement POST /reshaper, but with only changes included.
Other payload formats for the placement operation (see the etherpad). We chose the one we did because it reuses existing payload syntax (and may therefore be able to reuse code) and it provides a full specification of the expected end state, which is RESTy.

Direct Placement

Alternatives to the wsgi-intercept model for the Direct Interface to Placement:

Directly access the object methods (with some refactoring/cleanup). Rejected because we lose things like schema validation and microversion logic.
Create cleaner, pythonic wrappers around those object methods. Rejected (in the short term) for the sake of expediency. We might take this approach longer-term as/when the demand for direct placement expands beyond FFU scripting.
Use wsgi-intercept but create the pythonic wrappers outside of the REST layer. This is also a long-term option.

Reshaping Via update_provider_tree()

We considered passing allocations to update_provider_tree() every time, but gathering the allocations will be expensive, so we needed a way to do it only when necessary. Enter ReshapeNeeded exception.
We considered running the check-and-reshape-if-needed algorithm on every periodic interval, but decided we should never need to do a reshape except on startup.

Data model impact

None.

REST API impact

See Placement POST /reshaper.

Security impact

None.

Notifications impact

None.

Other end user impact

See Upgrade Impact.

Performance Impact

The new Placement POST /reshaper operation has the potential to be slow, and to lock several tables. Its use should be restricted to reshaping provider trees. Initially we may use the reshaper from update_from_provider_tree() even if no reshape is being performed; but if this is found to be problematic for performance, we can restrict it to only reshape scenarios, which will be very rare.

Gathering allocations, particularly in large deployments, has the potential to be heavy and slow, so we only do this at compute startup, and then only if update_provider_tree() indicates that a reshape is necessary.

Other deployer impact

See Upgrade Impact.

Developer impact

See Virt Drivers.

Upgrade impact

Live upgrades are covered. The Resource Tracker _update() flow will run on compute start and perform the reshape as necessary. Since we do not support skipping releases on live upgrades, any virt driver-specific changes can be removed from one release to the next.

The Offline Upgrade Script is provided for Fast-Forward Upgrade. Since code is run with each release’s codebase for each step in the FFU, any virt driver-specific changes can be removed from one release to the next. Note, however, that the script must always be run since only the virt driver, running on a specific compute, can determine whether a reshape is required for that compute. (If no reshape is necessary, the script is a no-op.)

Implementation

Assignee(s)

Placement POST /reshaper: jaypipes (SQL-fu), cdent (API plumbing)
Direct Interface to Placement: cdent
Report client, resource tracker, virt driver parity: efried
Offline Upgrade Script: dansmith
Reviews and general heckling: mriedem, bauzas, gibi, edleafe, alex_xu

Work Items

See Proposed change.

Dependencies

Testing

Functional test enhancements for everyone, including gabbi tests for Placement POST /reshaper.

Live testing in Xen (naichuans) and libvirt (bauzas) via their VGPU work.

Documentation Impact

Placement POST /reshaper (placement API reference)
Offline Upgrade Script (nova-manage db)

References

Consumer Generations spec
Nested Resource Providers - Allocation Candidates
Placement reshaper API discussion etherpad
Upgrade concerns… mailing list thread
RFC 6902 (PATCH with json-patch+json)
RFC 7396 (PATCH with merge-patch+json)
nova-manage db migration helper docs
wsgi-intercept
Python requests
PlacementDirect implementation
oslo config library

History

Revisions
Release Name	Description
Rocky	Introduced

Handling Reshaped Provider Trees

Wed, 05 Jul 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree

Problem description

Use Cases

The libvirt driver currently inventories VGPU resources on the compute node provider. In order to exploit provider trees, libvirt needs to create one child provider per physical GPU and move the VGPU inventory from the compute node provider to these GPU child providers. In a live deployment where VGPU resources are already allocated to instances, the allocations need to be moved along with the inventories.
Drivers wishing to model NUMA must similarly create child providers and move inventory and allocations of several classes (processor, memory, VFs on NUMA-affined NICs, etc.) to those providers.
A driver is using a custom resource class. That resource class is added to the standard set (under a new, non-CUSTOM_ name). In order to use the standard name, the driver must move inventory and allocations from the old name to the new.

These are just example cases that may exist now or in the future. We’re describing a generic pivot system here.

Proposed change

The overall flow is as follows. The parts in red only happen when a reshape is needed. This represents the happy path on compute startup only.

Note that, for Fast-Forward Upgrades, the Resource Tracker lane is actually the Offline Upgrade Script.

SchedulerReportClient.get_allocations_for_provider_tree()

A new SchedulerReportClient method shall be implemented:

def get_allocations_for_provider_tree(self, context, nodename):
    """Retrieve allocation records associated with all providers in the
    provider tree.

    :param context: The security context
    :param nodename: The name of a node for whose tree we are getting
            allocations.
    :returns: A dict, keyed by consumer UUID, of allocation records:
            { $CONSUMER_UUID: {
                  # The shape of each "allocations" dict below is identical
                  # to the return from GET /allocations/{consumer_uuid}
                  "allocations": {
                      $RP_UUID: {
                          "generation": $RP_GEN,
                          "resources": {
                              $RESOURCE_CLASS: $AMOUNT,
                              ...
                          },
                      },
                      ...
                  },
                  "project_id": $PROJ_ID,
                  "user_id": $USER_ID,
                  "consumer_generation": $CONSUMER_GEN,
              },
              ...
            }
    """

A consumer isn’t always an instance (it may be a “migration” - or other things not created by Nova, in the future), so we can’t just use the instance list as the consumer list.

We can’t get all allocations for associated sharing providers because some of those will belong to consumers on other hosts.

So we have to discover all the consumers associated with the providers in the “local” tree (identified by nodename):

for each "local" provider:
    GET /resource_providers/{provider.uuid}/allocations

We can’t use just those allocations because we would miss allocations for sharing providers. So we have to get all the allocations for just the consumers discovered above:

for each consumer in ^:
    GET /allocations/{consumer.uuid}

Note

Return a dict, keyed by the {consumer.uuid}, of the resulting allocation records. This is the form of the new Allocations Parameter expected by update_provider_tree and update_from_provider_tree), and return it.

ReshapeNeeded exception

A new exception, ReshapeNeeded, will be introduced. It is used as a signal from update_provider_tree to indicate that a reshape must be performed. This is for performance reasons so that we don’t get_allocations_for_provider_tree unless it’s necessary.

ReshapeFailed exception

A new exception, ReshapeFailed, will be introduced. It is raised from update_from_provider_tree only when reshaping is needed, attempted, and unsuccessful (i.e. when the Placement POST /reshaper call fails). This is so we can trap it explicitly in _update_available_resource_for_node and kill the compute service.

Changes to update_provider_tree()

Allocations Parameter

A new allocations keyword argument will be added to update_provider_tree():

def update_provider_tree(self, provider_tree, nodename, allocations=None):

If None, the upgrade_provider_tree() method must not perform a reshape. If it decides a reshape is necessary, it must raise the new ReshapeNeeded exception.

When not None, the allocations argument is a dict, keyed by consumer UUID, of allocation records of the form:

{ $CONSUMER_UUID: {
      # NOTE: The shape of each "allocations" dict below is identical to the
      # return from GET /allocations/{consumer_uuid}...
      "allocations": {
          $RP_UUID: {
              "generation": $RP_GEN,
              "resources": {
                  $RESOURCE_CLASS: $AMOUNT,
                  ...
              },
          },
          ...
      },
      "project_id": $PROJ_ID,
      "user_id": $USER_ID,
      # ...except for this, which is coming in bp/add-consumer-generation
      "consumer_generation": $CONSUMER_GEN,
  },
  ...
}

If update_provider_tree() is moving allocations, it must edit the allocations dict in place.

Note

I don’t love the idea of the method editing the dict in place rather than returning a copy, but it’s consistent with how we’re handling the provider_tree arg.

Virt Drivers

Virt drivers currently overriding update_provider_tree() will need to change the signature to accomodate the new parameter. That work will be done within the scope of this blueprint.

As virt drivers begin to model resources in nested providers, their implementations will need to:

determine whether a reshape is necessary and raise ReshapeNeeded as appropriate;
perform the reshape by processing provider inventories and the specified allocations.

That work is outside the scope of this blueprint.

Changes to update_from_provider_tree()

The SchedulerReportClient.update_from_provider_tree() method is changed to accept a new parameter allocations:

def update_from_provider_tree(self, context, new_tree, allocations):
    """Flush changes from a specified ProviderTree back to placement.

    ...

    ...
    :param allocations: A dict, keyed by consumer UUID, of allocation records
            of the form returned by GET /allocations/{consumer_uuid}. The
            dict must represent the comprehensive final picture of the
            allocations for each consumer therein. A value of None indicates
            that no reshape is being performed.
    ...
    """

When allocations is None, the behavior of update_from_provider_tree() is as it was previously (in Queens).

Changes to Resource Tracker _update()

The _update() method will get a new parameter, startup, which is percolated down from update_available_resource().

Where update_provider_tree and update_from_provider_tree are currently invoked, the code flow will be changed to approximately:

try:
    self.driver.update_provider_tree(prov_tree, nodename)
except exception.ReshapeNeeded:
    if not startup:
        # This isn't supposed to happen during periodic, so raise
        # it up; the compute manager will treat it specially,
        # killing the compute service.
        raise
    LOG.info("Performing resource provider inventory and "
             "allocation data migration during compute service "
             "startup or FFU.")
    allocs = reportclient.get_allocations_for_provider_tree()
    self.driver.update_provider_tree(prov_tree, nodename,
                                     allocations=allocs)
...
reportclient.update_from_provider_tree(context, prov_tree, allocs)

Changes to _update_available_resource_for_node()

This is currently where all exceptions for the Resource Tracker _update periodic task are caught, logged, and otherwise ignored. We will add new except conditions for reshape-related exceptions that will actually blow up the compute service (i.e. not log-and-otherwise-ignore). These exceptions should only legitimately reach this method on startup.

Placement POST /reshaper

In a new placement microversion, a new POST /reshaper operation will be introduced. The payload is of the form:

{
  "inventories": {
    $RP_UUID: {
      # This is the exact payload format for
      # PUT /resource_provider/$RP_UUID/inventories.
      # It should represent the final state of the entire set of resources
      # for this provider. In particular, omitting a $RC dict will cause the
      # inventory for that resource class to be deleted if previously present.
      "inventories": { $RC: { <total, reserved, etc.> } }
      "resource_provider_generation": <gen of this RP>,
    },
    $RP_UUID: { ... },
  },
  "allocations": {
    # This is the exact payload format for POST /allocations
    $CONSUMER_UUID: {
      "project_id": $PROJ_ID,
      "user_id": $USER_ID,
      # This field is part of the consumer generation series under review,
      # not yet in the published POST /allocations payload.
      "consumer_generation": $CONSUMER_GEN,
      "allocations": {
        $RP_UUID: {
          "resources": { $RC: $AMOUNT, ... }
        },
        $RP_UUID: { ... }
      }
    },
    $CONSUMER_UUID: { ... }
  }
}

In a single atomic transaction, placement replaces the inventories for each $RP_UUID in the inventories dict; and replaces the allocations for each $CONSUMER_UUID in the allocations dict.

Return values:

204 No Content on success.
409 Conflict on any provider or consumer generation conflict; or if a concurrent transaction is detected. Appropriate error codes should be used for at least the former so the caller can tell whether a fresh GET is necessary before recalculating the necessary reshapes and retrying the operation.
400 Bad Request on any other failure.

Direct Interface to Placement

This implementation provides a quick solution to the immediate needs of offline use of Placement POST /reshaper while allowing options for prettier solutions in the future.

Offline Upgrade Script

To facilitate Fast Forward Upgrades, we will provide a script that can perform this reshaping while all services (except databases) are offline. It will look like:

nova-manage placement migrate_compute_inventory

…and operate as follows, for each nodename (one, except for ironic) on the host:

Spin up a SchedulerReportClient with a Direct Interface to Placement.
Retrieve a ProviderTree via SchedulerReportClient.get_provider_tree_and_ensure_root().
Instantiate the appropriate virt driver.
Perform the algorithm noted in Resource Tracker _update, as if startup is True.

We may refer to https://review.openstack.org/#/c/501025/ for an example of an upgrade script that requires a virt driver.

Alternatives

Reshaper API

Alternatives to Placement POST /reshaper were discussed in the mailing list thread, the etherpad, IRC, hangout, etc. They included:

Don’t have an atomic placement operation - do the necessary operations one at a time from the resource tracker. Rejected due to race conditions: the scheduler can schedule against the moving inventories, based on incorrect capacity information due to the moving allocations.
“Lock” the moving inventories - either by providing a locking API or by setting reserved = total - while the resource tracker does the reshape. Rejected because it’s a hack; and because recovery from partial failures would be difficult.
“Merge” forms of the new placement operation:
- PATCH (or POST) with RFC 6902-style "operation", "path"[, "from", "value"] instructions.
- PATCH (or POST) with RFC 7396 semantics. The JSON payload would look like a sparse version of that described in Placement POST /reshaper, but with only changes included.
Other payload formats for the placement operation (see the etherpad). We chose the one we did because it reuses existing payload syntax (and may therefore be able to reuse code) and it provides a full specification of the expected end state, which is RESTy.

Direct Placement

Alternatives to the wsgi-intercept model for the Direct Interface to Placement:

Directly access the object methods (with some refactoring/cleanup). Rejected because we lose things like schema validation and microversion logic.
Create cleaner, pythonic wrappers around those object methods. Rejected (in the short term) for the sake of expediency. We might take this approach longer-term as/when the demand for direct placement expands beyond FFU scripting.
Use wsgi-intercept but create the pythonic wrappers outside of the REST layer. This is also a long-term option.

Reshaping Via update_provider_tree()

We considered passing allocations to update_provider_tree every time, but gathering the allocations will be expensive, so we needed a way to do it only when necessary. Enter ReshapeNeeded exception.
We considered running the check-and-reshape-if-needed algorithm on every periodic interval, but decided we should never need to do a reshape except on startup.

Data model impact

None.

REST API impact

See Placement POST /reshaper.

Security impact

None.

Notifications impact

None.

Other end user impact

See Upgrade Impact.

Performance Impact

The new Placement POST /reshaper operation has the potential to be slow, and to lock several tables. Its use should be restricted to reshaping provider trees. Initially we may use the reshaper from update_from_provider_tree even if no reshape is being performed; but if this is found to be problematic for performance, we can restrict it to only reshape scenarios, which will be very rare.

Gathering allocations, particularly in large deployments, has the potential to be heavy and slow, so we only do this at compute startup, and then only if update_provider_tree indicates that a reshape is necessary.

Other deployer impact

See Upgrade Impact.

Developer impact

See Virt Drivers.

Upgrade impact

Live upgrades are covered. The Resource Tracker _update flow will run on compute start and perform the reshape as necessary. Since we do not support skipping releases on live upgrades, any virt driver-specific changes can be removed from one release to the next.

Implementation

Assignee(s)

Placement POST /reshaper: jaypipes (SQL-fu), cdent (API plumbing)
Direct Interface to Placement: cdent
Report client, resource tracker, virt driver parity: efried
Offline Upgrade Script: dansmith
Reviews and general heckling: mriedem, bauzas, gibi, edleafe, alex_xu

Work Items

See Proposed change.

Dependencies

Testing

Functional test enhancements for everyone, including gabbi tests for Placement POST /reshaper.

Live testing in Xen (naichuans) and libvirt (bauzas) via their VGPU work.

Documentation Impact

Placement POST /reshaper (placement API reference)
Offline Upgrade Script (nova-manage db)

References

Consumer Generations spec
Nested Resource Providers - Allocation Candidates
Placement reshaper API discussion etherpad
Upgrade concerns… mailing list thread
RFC 6902 (PATCH with json-patch+json)
RFC 7396 (PATCH with merge-patch+json)
nova-manage db migration helper docs
wsgi-intercept
Python requests
PlacementDirect implementation
oslo config library

History

Revisions
Release Name	Description
Rocky	Introduced
Stein	Reproposed

Nova - Cyborg Interaction

Wed, 05 Jul 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/nova-cyborg-interaction

This specification describes the Nova - Cyborg interaction needed to create and manage instances with accelerators, and the changes needed in Nova to accomplish that.

Problem description

Scope

Representation: Cyborg shall represent devices as nested resource providers under the compute node (except possibly for disaggregated servers), accelerator types as resource classes and accelerators as inventory in Placement. The properties needed for scheduling are represented as traits. This is specified by [1]. This spec does not dwell on this topic.
Discovery and Updates: Among the devices discovered in a host, Cyborg intends to claim only those that are not included under the PCI Whitelisting mechanism. Cyborg shall update Placement in a way that is compatible with the virt driver’s update of Placement. These aspects are addressed in sections Coexistence with PCI whitelists and Placement update respectively.
User requests for accelerators: Users usually request compute resources via flavors. However, since the requests for devices may be highly varied, placing them in flavors may result in flavor explosion. We avoid that by expressing device requests in a device profile [2] . The relationship between device profiles and flavors is explored in Section User requests.

When an instance creation (boot) request is made, the contents of a device profile shall be translated to request groups in the request spec; the syntax in request groups is covered in Section Updating the Request Spec.
Instance scheduling: Nova shall use the Placement data populated by Cyborg to schedule instances. This spec does not dwell on this topic.
Assignment of accelerators: We introduce the concept of Accelerator Request objects in Section Accelerator Requests. The workflow to create and use them is summarized in Section Nova changes for Assignment workflow. The same section also highlights the Nova changes needed. The details of the Cyborg API implementation for this workflow is covered in Cyborg specs ([3]).
Instance operations: The behavior with respect to accelerators for all standard instance operations are defined in [4]. This spec does not dwell on this topic.

Use Cases

A user requests an instance with one or more accelerators of different types assigned to it.
An operator may provide users with both Device as a Service or Accelerated Function as a Service in the same cluster (see [1]).

The following use cases are not addressed in Train but are of long term interest:

A user requests to add one or more accelerators to an existing instance.
Live migration with accelerators.

Proposed change

Coexistence with PCI whitelists

Ideally, there should be a single way for the operator to identify which PCI devices should be claimed by Nova and which by Cyborg. This could be along the lines suggested in [5] or [6]. If such a mechanism could be agreed upon by all stakeholders, Cyborg could adopt it.

Until that point, the operator tells Cyborg which devices to claim by using Cyborg’s configuration file. The operator must ensure that this is compatible with the PCI whitelists configured in Nova.

Placement update

Cyborg shall call Placement API directly to represent devices and accelerators. Some of the intended use cases for the API invocation are:

Create or delete child RPs under the compute node RP.
Create or delete custom RCs and custom traits.
Associate traits with RPs or remove such association.
Update RP inventory.

Cyborg shall not modify the RPs created by any other component, such as Nova virt drivers.

User requests

The user request for accelerators is encapsulated in a device profile [2], which is created and managed by the admin via the Cyborg API.

In the initial phase, Nova API remains as today. The device profile is folded into the flavor as an extra spec by the operator, as below:

openstack flavor set --property 'accel:device_profile=<profile_name>' flavor

Thus the standard Nova API can be used to create an instance with only the flavor (without device profiles), like this:

openstack server create --flavor f ....  # instance creation

In the future, device profile may be used by itself to specify accelerator resources for the instance creation API.

Updating the Request Spec

When the device profile request groups are added to other request groups in the flavor, the group_policy of the flavor shall govern the overall semantics of all request groups.

Accelerator Requests

Each ARQ needs to be matched to the specific RP in the allocation candidate that Nova has chosen, before the ARQ is bound. Since Placement does not match RPs to request groups, this must be done in the Cyborg client module of Nova (cyborg-client-module). The matching is done using the requester_id field in the RequestGroup object ([8]) as below:

The order of request groups in a device profile is not significant, but it is preserved by Cyborg. Thus, each device profile request group has a unique index.
When the device profile request groups returned by Cyborg are added to the request spec, the requester_id field is set to ‘device_profile_<N>’ for the N-th device profile request group (starting from zero). The device profile name need not be included here because there is only one device profile per request spec.
When Cyborg creates an ARQ for a device profile, it embeds the device profile request group index in the ARQ before returning it to Nova.
The matching is done in two steps:
- Each ARQ is mapped to a specific request group in the request spec using the requester_id field.
- Each request group is mapped to a specific RP using the same logic as the Neutron bandwidth provider ([9]).

Nova changes for Assignment workflow

This section summarizes the workflow details for Phase 1. The changes needed in Nova are marked with NEW.

NEW: A Cyborg client module is added to nova (cyborg-client-module). All Cyborg API calls are routed through that.

The Nova API server receives a POST /servers API request with a flavor that includes a device profile name.
NEW: The Nova API server calls the Cyborg API GET /v2/device_profiles?name=$device_profile_name and gets back the device profile request groups. These are added to the request spec.
The Nova scheduler invokes Placement and gets a list of allocation candidates. It selects one of those candidates and makes claim(s) in Placement. The Nova conductor then sends a RPC message build_and_run_instances to the Nova compute manager.
NEW: Nova calls the Cyborg API POST /v2/accelerator_requests with the device profile name. Cyborg creates a set of unbound ARQs for that device profile and returns them to Nova. (The call may originate from Nova conductor or the compute manager; that will be settled in code review.)
NEW: The Cyborg client in Nova matches each ARQ to the resource provider picked for that accelerator. See match-rp.
NEW: The Nova compute manager calls the Cyborg API PATCH /v2/accelerator_requests to bind the ARQ with the host name, device’s RP UUID and instance UUID. This is an asynchronous call which prepares or reconfigures the device in the background.

NEW: Cyborg, on completion of the bindings (successfully or otherwise), calls Nova’s POST /os-server-external-events API with:

{
   "events": [
      { "name": "arq_resolved",
        "tag": $arq_uuid,
        "server_uuid": $instane_uuid,
        "status": "ok" # or "failed"
      },
      ...
   ]
}

NEW: The Nova virt driver waits for the notification, subject to the timeout mentioned in Section Other deployer impact. It then calls the Cyborg REST API GET /v2/accelerator_requests?instance=<uuid>&bind_state=resolved.
NEW: The Nova virt driver uses the attach handles returned from the Cyborg call to compose PCI passthrough devices into the VM’s definition.
NEW: If there is any error after binding has been initiated, Nova must unbind the relevant ARQs by calling Cyborg API. It may then retry on another host or delete the (unbound) ARQs for the instance.

This flow is captured by the following sequence diagram, in which the Nova conductor and scheduler are together represented as the Nova controller. The ARQ creation is shown to happen in Nova compute manager only for concreteness; it may be in the controller instead.

Alternatives

It is possible to have the Nova virt driver poll for the Cyborg ARQ binding completion. That is not preferable, partly because that is not the pattern of interaction with other services like Neutron.

Data model impact

None

REST API impact

None. A new extra_spec key accel:device_profile_name is added to the flavor.

Security impact

None

Notifications impact

Nova may choose to add additional notifications for Cyborg API calls.

Other end user impact

None

Performance Impact

The extra calls to Cyborg REST API may potentially impact Nova conductor/scheduler throughput. This has been mitigated by making some critical Cyborg operations as asynchronous tasks.

Other deployer impact

The deployer needs to set up the clouds.yaml file so that Nova can call the Cyborg REST API.

The deployer needs to configure a new tunable in nova-cpu.conf:

* arq_binding_timeout (integer): Time in seconds for Nova compute
  manager to wait for Cyborg to notify that ARQ binding is done.
  Timeout is fatal, i.e., VM startup is aborted with an exception.
  Default: 300.

Developer impact

Define two new standard resource classes: FPGA and PGPU.

We have VGPU and VGPU_DISPLAY_HEAD RCs defined already. But we propose a PGPU as a different RC for the following reasons:

Both VGPU and VGPU_DISPLAY_HEAD RCs specifically refer to virtual GPUs. We need a different one for physical GPUs.

It will be subject to separate quotas/limits in Keystone.

Using PCI_DEVICE RC is too general: we want quotas for GPU RC specifically.

Upgrade impact

None

Implementation

Assignee(s)

Sundar Nadathur

Work Items

See the steps marked NEW in Nova changes for Assignment workflow section.

Dependencies

Specification for device profiles [2].
Cyborg API specification [10].

Testing

There need to be unit tests and functional tests for the Nova changes. Specifically, there needs to be a functional test fixture that mocks the Cyborg API calls.

Documentation Impact

Device profile creation needs to be documented in Cyborg, as noted in [2].

The need for operator to fold the device profile into the flavor needs to be documented.

History

Revisions
Release Name	Description
Train	Introduced

libvirt driver support for flavor and image defined ephemeral encryption

Mon, 26 Jun 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-encryption-libvirt

This spec outlines the specific libvirt virt driver implementation to support the Flavor and Image defined ephemeral storage encryption [1] spec.

Problem description

The libvirt virt driver currently provides very limited support for ephemeral disk encryption through the LVM imagebackend and the use of the PLAIN encryption format provided by dm-crypt.

Use Cases

As a user of a cloud with libvirt based computes I want to request that all of my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user of a cloud with libvirt based computes I want to be able to pick how my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user I want each encrypted ephemeral disk attached to my instance to have a separate unique secret associated with it.
As an operator I want to allow users to request that the ephemeral storage of their instances is encrypted using the flexible LUKSv1 encryption format.

Proposed change

Deprecate the legacy implementation within the libvirt driver

The legacy implementation using dm-crypt within the libvirt virt driver needs to be deprecated ahead of removal in a future release, this includes the following options:

[ephemeral_storage_encryption]/enabled
[ephemeral_storage_encryption]/cipher
[ephemeral_storage_encryption]/key_size

Limited support for dm-crypt will be introduced using the new framework before this original implementation is removed.

Populate disk_info with encryption properties

This dict currently contains the following:

disk_bus: The default bus used by disks
cdrom_bus: The default bus used by cd-rom drives
mapping: A nested dict keyed by disk name including information about each disk.

Each item within the mapping dict containing following keys:

bus: The bus for this disk
dev: The device name for this disk as known to libvirt
type: A type from the BlockDeviceType enum (‘disk’, ‘cdrom’,’floppy’, ‘fs’, or ‘lun’)

It can also contain the following optional keys:

format: Used to format swap/ephemeral disks before passing to instance (e.g. ‘swap’, ‘ext4’)
boot_index: The 1-based boot index of the disk.

In addition to the above this spec will also optionally add the following keys for encrypted disks:

encryption_format: The encryption format used by the disk
encryption_options: A dict of encryption options
encryption_secret_uuid: The UUID of the encryption secret associated with the disk

Handle ephemeral disk encryption within imagebackend

With the above in place we can now add encryption support within each image backend. As highlighted at the start of this spec this initial support will only be for the LUKSv1 encryption format.

Generic key management code will be introduced into the base nova.virt.libvirt.imagebackend.Image class and used to create and store the encryption secret within the configured key manager. The initial LUKSv1 support will store a passphrase for each disk within the key manager. This is unlike the current ephemeral storage encryption or encrypted volume implementations that currently store a symmetric key in the key manager. This remains a long running piece of technical debt in the encrypted volume implementation as LUKSv1 does not directly encrypt data with the provided key.

Each backend will then be modified to encrypt disks during nova.virt.libvirt.imagebackend.Image.create_image using the provided format, options and secret.

Enable the `COMPUTE_EPHEMERAL_ENCRYPTION_LUKS` trait

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

As discussed above the ephemeral encryption keys will be added to the disk_info for individual disks within the libvirt driver.

REST API impact

N/A

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

Other deployer impact

N/A

Developer impact

Upgrade impact

The legacy implementation is deprecated but will continue to work for the time being. As the new implementation is separate there is no further upgrade impact.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Populate the individual disk dicts within disk_info with any ephemeral encryption properties.
Provide these properties to the imagebackends when creating each disk.
Introduce support for LUKSv1 based encryption within the imagebackends.
Enable the COMPUTE_EPHEMERAL_ENCRYPTION_LUKS trait when the selected imagebackend supports LUKSv1.

Dependencies

Flavor and Image defined ephemeral storage encryption [1]

Testing

Documentation Impact

New user documentation around the specific LUKSv1 support for ephemeral encryption within the libvirt driver.
Reference documentation around the changes to the virt block device layer.
Document that for the raw imagebackend, both [libvirt]images_type = raw and [DEFAULT]use_cow_images = False must be configured in order for resize to work. This is also true without encryption but it may still be helpful to users.
Document that a user must have policy permission to create secrets in Barbican in order for encryption to work for that user. Secrets are created in Barbican using the user’s auth token. Admins have permission to create secrets in Barbican by default.

References

Revisions
Release Name	Description
Wallaby	Introduced
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed
2023.2 Bobcat	Reproposed

Policy Service Role Default

Fri, 28 Apr 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/policy-service-role-default

Problem description

Use Cases

As an operator I want to keep service role user to access service-to-service APIs with least privilege.

Proposed change

We need to make sure all the policy rules for internal service-to-service APIs are default to service role only. Example:

policy.DocumentedRuleDefault(
    name='os_compute_api:os-server-external-events:create',
    check_str='role:service',
    scope_types=['project']
)

As Nova have dropped the system scope implementation, service-to-service communication with service role will be done with project scope token (which is currently done in devstack setup).

Below APIs policy will be default to service role:

os_compute_api:os-assisted-volume-snapshots:create
os_compute_api:os-assisted-volume-snapshots:delete
os_compute_api:os-volumes-attachments:swap
os_compute_api:os-server-external-events:create

Alternatives

Keep the service-to-service APIs default same as it is and expect operators to take care of the service role users access permissions by overriding it in the policy.yaml.

Data model impact

None

REST API impact

Below APIs policy will be default to service role:

os_compute_api:os-assisted-volume-snapshots:create
os_compute_api:os-assisted-volume-snapshots:delete
os_compute_api:os-volumes-attachments:swap
os_compute_api:os-server-external-events:create

Security impact

Easier to understand service-to-service APIs policy and restricting them to least privilege.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Developer impact

New APIs must add policies that follow the new pattern.

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: gmann

Feature Liaison

Feature liaison:: dansmith

Work Items

Modify the service-to-service APIs defaults
Modify policy rule unit tests

Dependencies

None

Testing

Modify or add the policy unit tests.

Add a job enabling the new defaults and run the tempest tests to make sure existing service-service APIs communication work fine. If needed modify the token used by services as per the new defaults.

Documentation Impact

API Reference should be updated to add all the service-service APIs under separate section and mention about service role as their default.

References

History

Revisions
Release Name	Description
2023.1	Introduced
2023.2	Re-proposed

Use extend volume completion action

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/assisted-volume-extend

Problem description

In this case, only the QEMU process holding the lock can resize the volume, which can be triggered through the QEMU monitor command block-resize.

There is currently no adequate way for Cinder to use this feature, so the NFS, NetApp NFS, Powerstore NFS, and Quobyte volume drivers all disable extending attached volumes.

Use Cases

Proposed change

Currently, Cinder will send the volume-extended external server event to Nova only after it has finalized the extend operation and reset the volume status from extending back to in-use.

Compute Agent

Nova’s compute agent will use the volume status to differentiate between the two behaviors when handling volume-extended events:

If the volume status is extending, then it will attempt to read extend_new_size from the volume’s metadata and use this value as the new size of the volume, instead of the volume size field.

After successfully extending the volume, it will call the extend volume completion action of the volume, with "error": false.

If anything goes wrong, including extend_new_size being missing from the metadata, or being smaller than the current size of the volume, it will log the error and call the os-extend_volume_completion action with "error": true, so Cinder can roll back the operation.
For any other volume status, including in-use, the event will be handled as before.

API

Nova’s API will introduce a new microversion, so that Cinder can make sure the new behavior is available, before leaving an extend operation unfinished.

Alternatives

A previous change tried to use the volume-extended external server event to support online extend for the NFS driver [1], but did not rely on feedback from Nova to Cinder at all. Instead, it would just set the new size of the volume, change the status back to in-use, notify Nova, and hope for the best.

If anything went wrong on Nova’s side, this would still result in a volume state indicating that the operation was successful, which is not acceptable.
A previous version of this spec proposed a new synchronous API in Nova [2], that would directly call CompVirtAPI.extend_image of the nova-compute instance managing the guest that a volume was attached to. This API would provide a single mechanism to trigger the resize operation, communicate the new size to Nova, and get feedback on the success of the operation.

The problem with a synchronous API is, that RPC and API timeouts limit the maximum time an extend operation can take. For QEMU, this seemed to be acceptable, because storage preallocation is hard disabled for the block-resize command, and because all currently plausible file systems support sparse file operations.

However, this may not be true for other volume or virt drivers that might require this API in the future. It would also break with the established pattern of asynchronous coordination between Nova and Cinder, which includes the assisted snapshot and volume migration features.
Following this pattern, we could make the proposed API asynchronous and use a new callback in Cinder, similar to Nova’s os-assisted-volume-snapshots API, which uses the os-update_snapshot_status snapshot action to provide feedback to Cinder.

The function of the new Nova API would then just be to trigger the operation and to communicate the new size. The question is then, whether that warrants adding a new API to Nova, since there are existing mechanisms that could be used for either.
The existing mechanism for triggering the extend operation in Nova is of course the volume-extended external server event. Using it for this purpose, as this spec proposes, requires the target size to be transferred separately, because external server events only have a single text field that is freely usable, which for volume-extended is already used for the volume ID.

Besides storing it in the admin metadata, as [3] and this spec propose, there is also the option of updating the size field of the volume, as [1] was essentially doing.

This would require the volume size field to be reset on a failure. If an error response from Nova was lost, the volume would just keep the new size. We would need to extend os-reset_status to allow a size reset, or something similar to clean up volumes like this. This would be possible, but updating the size field only after the volume was successfully extended seems like a cleaner solution.
We could also extend the external server event API to accept additional data for events, and use this to communicate the new size to Nova.

This option was judged favorably by reviewers on the previous version of this spec, [2], but it would be a more complex change to the Nova API.

However, if additional data fields become available in a future version of the external server event API, it would be a relatively minor change to use this instead of volume metadata.

Data model impact

None

REST API impact

The behavior of the external server event API will change.

If Nova receives a volume-extended event, and the referenced volume has status of extending, Nova will look for the extend_new_size key in the volume metadata, and use this instead of the volume size field as the target size to update the block device mapping and to pass to the virt driver’s extend_volume method.

Nova will also attempt to call Cinder’s new os-extend_volume_completion volume action proposed in [3] to let Cinder know if the operation was successful or not.
Otherwise, the API will behave as before.

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

Checking the target compute service version allows the API to handle rolling upgrades gracefully.

Implementation

Assignee(s)

Primary assignee:: kgube
Other contributors:: None

Feature Liaison

Feature liaison:: None yet

Work Items

Update the external server event API to check the target compute service version for volume-extended events.
Update the ComputeVirtAPI.extend_volume method to follow the behavior outlined in Compute Agent.
Add unit tests.
Adapt NFS job in the Nova gate to validate online extend.

Dependencies

The extend volume completion action [3]

Testing

We should test that the os-extend_volume_completion gets called correctly in all possible error or success condition if a volume has extending status.

We should test the case that the call to os-extend_volume_completion fails.

We also need to test that volume-extended continues to be handled correctly for volumes not in extending status.

Documentation Impact

The new behavior of the volume-extended event should be added to the documentation of the external server event API.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced

VNC console support for Ironic driver

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/ironic-vnc-console

The feature aims at providing a vnc console from Ironic.

Problem description

End users often have to troubleshoot their instances because they might have broken their boot configuration or locked themselves out with a firewall. Keyboard-Video-Mouse (KVM) access is often required for troubleshooting these types of issues as serial access is not always available or correctly configured. Also, KVM provides a better user experience as compared to serial console.

Horizon’s VNC console is not supported for the ironic nodes provisioned by Nova. This spec intends to extend that to graphical console via the novnc proxy.

Use Cases

The end user will be able to get workable vnc console url from baremetal server: switch console type on bm side to vnc openstack baremetal node console enable openstack console url show --novnc

nova_novncproxy should be deployed

Proposed change

the Ironic virt driver will have to implement get_vnc_console and return a ctype.ConsoleVNC with the required connection information (port/ip). Will raise ConsoleTypeUnavailable if vnc console is unavailable for the instance. To get the vnc console the node.get_console ironic API will be used (the same API that is used for serial console).

Alternatives

Accept this limitation and only offer a serial console. We can configure kvm access including access to the bios via the serial proxy and shell in a box for nova provisioned ironic baremetal instances.

Use out-of-band KVM access provided by administrator without Ironic support.

Data model impact

None

REST API impact

None

Security impact

The VNC connection to the nodes are secured by a token generated while creating the console in Nova. This bearer token is the only thing required to connect to the proxy, So the connection between user and proxy should be protected via ssl the same as with vms

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

most changes will be on ironic side. In ironic we have to choose which console will be used serial or vnc. This choice does not affect nova. On ironic side also will be implemented vnc proxy, to handle rfb handshake

additions to configs will be similar as for serial console: nova-novncproxy/nova.conf: [vnc] novncproxy_host = … novncproxy_port = … server_listen = … server_proxyclient_address = … auth_schemes = vnc

nova-compute-ironic/nova.conf: [vnc] enabled = true novncproxy_host = … novncproxy_port = … server_listen = … server_proxyclient_address = … novncproxy_base_url = …

nova-conductor/nova.conf [vnc] novncproxy_host = … novncproxy_port = … server_listen = … server_proxyclient_address = …

Developer impact

None

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: kirillgermanov
Other contributors:: None

Feature Liaison

None

Work Items

nova-compute-ironic: add new method get_vnc_console to Ironic virt driver

Dependencies

https://specs.openstack.org/openstack/ironic-specs/specs/not-implemented/vnc-graphical-console.html#id2

Testing

Add related unit test

Documentation Impact

update required

https://docs.openstack.org/nova/latest/admin/remote-console-access.html https://docs.openstack.org/ironic/latest/admin/console.html

References

https://blueprints.launchpad.net/nova/+spec/ironic-vnc-console - nova blueprint
https://review.opendev.org/c/openstack/ironic/+/860689 - gerrit review ironic
https://review.opendev.org/c/openstack/nova/+/863177 - gerrit review nova
https://stackoverflow.com/questions/16469487/vnc-des-authentication-algorithm

History

None

Allow local scaphandre directory to be mapped to an instance using virtiofs

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/virtiofs-scaphandre

Scaphandre is a tool that can be used to measure compute and VM power consumption down to processes. (https://github.com/hubblo-org/scaphandre)

Problem description

Use Cases

As a user, I want to know the consumption of my compute node and drill down to VM and VM processes individual consumption.
As an administrator, I want to allow this usage but make sure the user can mount only the configured required directory. I also want not to leak cloud design insights.

Proposed change

To simplify specifications, the feature will be named virtiofs-scaphandre.

Although this feature is implemented to support scaphandre, other tools could require this need. So the implementation will try to be as generic as possible.

This change relies partially on https://specs.openstack.org/openstack/nova-specs/specs/2023.1/approved/libvirt-virtiofs-attach-manila-shares.html specification to build the VM XML file including virtiofs settings (mostly driver part).

This implies the same requirements and limitation.

QEMU >=5.0 and libvirt >= 6.2
Associated instances use file backed memory or huge pages
Live migrate an instance will not be supported as life attach and detach has landed only “recently” in libvirt.

Change description:

Add a compute configuration option share_local_fs that specify mappings between compute source directory and VMs destination mount_tags.

share_local_fs = { "/var/lib/libvirt/scaphandre": "scaphandre" }

If the above configuration option is present starting the compute node, add a compute trait COMPUTE_SHARE_LOCAL_FS specifying the virtiofs-scaphandre feature is available on this compute.
Users can add hw:power_metrics as extra specs or hw_power_metrics image properties, and thus 2 things will happen:
1. Nova will schedule the instance to a host that has share_local_fs.
2. Nova will add the virtiofs settings in the instance XML file as specified by the following example.

<filesystem type='mount' accessmode='passthrough'>
    <driver type='virtiofs'/>
    <source dir='/var/lib/libvirt/scaphandre/<DOMAIN_NAME>'/>
    <target dir='mount_tag'/>
    <readonly />
</filesystem>

Note

The <DOMAIN_NAME> is the name reported by virsh list or OS-EXT-SRV-ATTR:instance_name. This is the common name between qemu process that scaphandre use to get the vm name and openstack.

The instance name can be defined using the instance_name_template. https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.instance_name_template

Example:

“OS-EXT-SRV-ATTR:instance_name”: “instance-00000034”
/usr/bin/qemu-system-x86_64 -name guest=instance-00000034…

As a result, user will be able to mount the compute source directory on his VM using the following command line.

user@instance $ mount -t virtiofs mount_tag /var/scaphandre

Note

The user can see the mount_tag in the instance metadata. Mount automation can be build based on this mechanism.

Alternatives

REST API impact

Data model impact

Introduce hw_powermetrics image property as a new property object.

Extend the flavor extra spec validation to check hw:power_metrics.

Security impact

The compute node filesystem will be shared read-only. This is to prevent any modification on the host by VM users.

Notifications impact

Other end user impact

The scaphandre installation and configuration on compute nodes is left to the openstack administrator.

Performance Impact

Other deployer impact

None

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)

Feature Liaison

Feature liaison:: uggla

Work Items

New configuration option.
Add new trait.
Changes to share the compute node filesystem if requested by an image property or a flavor extra spec.

Dependencies

None

Testing

Functional API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Antelope	Introduced

Allowing target state for evacuate

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/allowing-target-state-for-evacuate

In certain circumstances the operator may desire to evacuate running instances to stopped state regardless of the current state of the instance.

Problem description

The current evacuate instance API does not allow operators to set a desired target state to the evacuated instances. Restoring the original state of the instance when it was active on the source host may result in issues if the guest required a valid token to be started or prevent evacuation when using encrypted volumes.

Use Cases

As an operator, I would like to be able to evacuate instances to a shut-off state because my tenant workloads may have specific security requirements, that do not allow them to be started by the administrator.
As an operator, I would like to be able to evacuate VMs with encrypted volumes without making the barbican secret readable by admins and reducing the security.
As a user, if my instance is offline due to a host outage, I don’t necessarily want an admin evacuating it and bringing it back online without my knowledge as I may have already replaced it and the zombie coming back may cause a conflict.

Proposed change

As of the bumped version, the API will force the stopped state for evacuated instances. It is expected that before the bumped version the behavior stay the same, instances with state active or stopped will keep their state at destination.

With the new microversion nova will always evacuate the instance to SHUTOFF state.
The only way to keep the instance state after the evacuation is to use an older microversion.

Alternatives

It may be possible to enhance the API resetState to accept RUNNING and SHUTOFF.
It may be possible to allow stop’s action working with compute node down, But that would have created incoherence between the database and the real state of the instance.

Data model impact

None.

REST API impact

A microversion bump is expected. But no changes in the schema will appear.

POST /servers/{server_id}/action

{
    "evacuate": {
         "host": "b419863b7d814906a68fb31703c0dbd6",
    }
}

Security impact

None.

Notifications impact

None.

Other end user impact

The nova api-ref will be updated to reflect the changes.
Related to openstack client, nothing is expected to change instead of a noop bump.

Performance Impact

None.

Other deployer impact

None.

Developer impact

It has been agreed that this spec would not resolve the design issue whereby the evacuate server action starts the virtual machines and then stops it when the target state is stopped. An issue has been reported at:

https://bugs.launchpad.net/nova/+bug/1994967

Upgrade impact

Upgrade note will be added describing new behavior.
An RPC change is expected to make the compute manager handle the new target state, resulting in the version being incremented.
At API level, a min version check will ensure that all services are new enough to accept the request, if not the request will be rejected with a NotSupported exception.

Implementation

Assignee(s)

Primary assignee:: sahid-ferdjaoui
Other contributors:: None

Feature Liaison

Feature liaison:: None

Work Items

API changes with microversion
Testing for the changes.

Dependencies

None.

Testing

Unit and functional testing for API change.

Documentation Impact

The api-ref will be updated to reflect the changes.

References

https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-evacuate-action-detail

History

Revisions
Release Name	Description
2023.1 - Antelope	First introduction

Allow FQDN in hostname field

Wed, 08 Mar 2023 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/fqdn-in-hostname

Enable end users to specify an FQDN as the instance hostname

Problem description

Originally when nova was created the instance hostname was set from the instance display-name. Nova did not allow FQDNs in the display name and explicitly blocked it, but that was later removed as a bugfix.

Given the filtering of FQDN like strings was not done as a spec or feature nova never provided an API guarantee that FQDNs can be used when creating a server. After several bug reports of undesirable interactions with Designate we decided to extend the normalization that removes non-ASCII alpha-numeric character to also remove periods . from the hostname when initializing it form the display name.

In the Xena release we also introduced configurable instance hostnames by exposing the hostname field directly in the API but maintained our prohibition on FQDNs. https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/configurable-instance-hostnames.html

This spec seeks to extend the hostname field to allow FQDNs to be used as the hostname of an instance.

Use Cases

As an operator, I want to allocate domains to my tenants and use automation to validate that the VMs are created with an FQDN that is derived from that domain.

As a VNF vendor, I want to set the value of /etc/hostname to an FQDN automatically when creating an instance via the api leveraging cloud-init or another tool without using user-data.

Proposed change

add a new api microversion to opt into using an FQDN in the hostname field.
increase the character length limit on the host name filed form 63 to 255
remove the rejection of “.” and other legal characters in a FQDN.
currently, attempting to use multi-create with the --hostname parameter results in an error 400. This spec continues this behavior: multi-create with --hostname, be it FQDN or short name, continues to be disallowed.

..NOTE:

Today the instance.hostname field is propagated into the hostname field of
the instance metadata. With this change the instance.hostname field can be an
FQDN and that will also be propagated as done today without alteration.

Alternatives

Nova could add a FQDN field e.g. openstack server create –FQDN …

This raises ambiguity of when to use –hostname vs –FQDN and requires a change to the data model to store the new field.

Nova could add a domain field e.g. openstack server create –hostname my-host –domain my.domain.com …

This is better then –FQDN but still requires a db and object changes for little benefit.

Nova could try to propagate hostname changes to neutron ports and floating IPs. This is seen as risky, complex and hard to understand.

First if nova was to propagate hostname changes to the port dns_name field it would only be able to do so on ports that were created by nova, not pre created ports passed in by the API user. If we updated ports that were passed in it could break existing use-cases where an end user set the desired name.

Second the floating IP dns_name is not typically the same as the instance FQDN. The instance hostname or FQDN is typically an internal name used in the application and the floating ip is used to expose a public name for the service. i.e. the instance might be called webserver.cloud.com where as the dns_name of the floating IP might be blog.mysite.example.com.

Given the two reasons above, and the fact nova does not want to manage networking, propagation of hostname changes to neutron ports is out of scope.

Since this is only useful when using designate and designate already monitors nova’s notification endpoint to update dns records using the designate-sink component, this functionality can be implemented using designate if designate desire in the future.

For these reasons updating the hostname in other services, when its updated in nova, is out of scope.

..NOTE:

As is done today, if the instance.hostname is updated on an instance it will
be updated in the metadata service but not in the config-drive. If the
config-drive is ever regenerated such as via a cross cell
resize then the new value will be available to the guest via the
config drive. This does not change the behavior from before this change.

Data model impact

None

The database field is already large enough to hold any valid FQDN so no changes are required to the db. The instance object declares the hostname field as a string and also requires no changes.

REST API impact

A new API microversion will be introduced to allow FQDNs in the hostname field. Minor changes will be required to conditionally change the length restriction and disable some of the current validation when the new microversion is used but the code will remain for older microversions.

Security impact

None

Notifications impact

None

Other end user impact

Users will be able to set the hostname of their vm to an FQDN, however without using an external service like designate to advertise the FQDN it may not be resolvable without manual intervention.

Nova provides no guarantee of uniqueness or reachability of the FQDN provided by the end user.

As is the case today, nova will only set the dns_name on a neutron port once when the server is first created. If the end user updates the instance.hostname, it will be updated in the nova db and become visible in the metadata API hostname field. It is out of scope of the nova project to propagate this hostname change to any neutron ports, floating IPs or dns records.

Performance Impact

None

Other deployer impact

Deployers should be aware that unique FQDNs or hostnames cannot be enforced using the existing [DEFAULT]/osAPI_compute_unique_server_name_scope config option as that provides uniqueness of the display name, not the hostname.

This spec does not introduce a way to force the hostnames or FQDNs to be unique in any scope.

Developer impact

osc should be extended to support the new microversion.

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: notartom
Other contributors:: None

Feature Liaison

Feature liaison:: sean-k-mooney

Work Items

remove API restriction
update API sample tests
provide new microversion and API ref
update osc

Dependencies

None

Testing

This can be entirely tested with API/functional tests.

Documentation Impact

The API ref will be updated

References

None

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced

CPU state management in libvirt

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-cpu-state-mgmt

In many telecom or semi-static clouds, there is a bin packing problem where a node for all practical use cases is full since no further applications can be scheduled to the host but in reality, it still has unused CPU cores.

In a typical server system, an idle CPU core consumes 3-5 watts of power and produces 3-5 watts of heat which the data center cooling solution must accommodate. To facilitate reducing power usage and heat output in order to make data centers greener and less expensive, it makes sense to allow the power state of CPUs to be turned off or using another governor so that they can be optimized.

An easy possibility is to have the libvirt driver to support it by modifying power states from the kernel sysfs interface.

Problem description

For our large telco operators, they often find that they have 2-4 CPUs that are not usable due to CPU pinning/packing requirements per host. Each CPU consumes 3-5 watts per core or ~12-20 watts per host. Assuming a nominal cost per kWh of $0,20 and 1000 hosts, that means they pay $35,040 in wasted electricity a year alone from just the idle CPU usage plus the additional cost of dissipating all of the heat generated.

Furthermore, while many telco use-cases require low latency and high throughput, not all of them require the CPU to run at the max frequency.

Use Cases

As an operator using the nova libvirt driver, I would like to be able to disable or run slower cpu cores using the kernel sysfs interface.

As an operator, I want my nova-compute service to enable or put at max performance a CPU core if it will be in use for a new instance that is currently starting.

Proposed change

There are two parts to this proposal :

add a config option for declaring that CPUs are managed
add a config option for telling the performance strategy to use

Declaring CPUs as managed

We can add a config option to the hardware section to declare that the host CPUs are managed [1].

# registered in group [libvirt]
cfg.BoolOpt('cpu_power_management',
            default=False,
            help='Use libvirt to manage CPU cores performance.')

If this option is set to True, then at nova-compute startup, all the CPUs that are defined by the [compute]/cpu_dedicated_set option as dedicated will be tuned for minimum performance (either offlined or set to powersave) depending on the other CPU tuning configuration option explained below, but only if they don’t run an instance.

Note

Of course, shared CPUs wouldn’t have their performance to be modified as instances float between them. If we would like to support them too, then all CPU shared cores should be modified at the same time and once an instance is arriving, then all of them would need to have the max performance.

Define the performance strategy per compute service

Note

for the sake of simplicity, here we loudly state that the performance strategy will be defined against all CPU cores from the host, and explicitely not be defined per CPU core.

Since different performance strategies could be taken per operator, we let them decide which one they prefer per compute service. The current list of performance strategies will be :

online/offline CPU cores
flip between performance and powersave CPU core governors

The initial implementation won’t propose any other strategy (like reducing the CPU clock) and we don’t expect those other strategies to be implemented in a foreseeable future.

Note

This is the operator’s responsibility to verify that the OS kernel is recent enough to support CPU core tuning and that those CPU cores have their governors supporting both the performance and the powersave profiles.

A configuration option will accordingly be defined for choosing between those two strategies :

# registered in group [libvirt]
cfg.StrOpt('cpu_power_management_strategy',
           choices=['cpu_state', 'governor']
           default='cpu_state',
           help='Tuning strategy to reduce CPU power consumption when '
                'unused')

Two specific config options will be defined for telling which governors to use.

# registered in group [libvirt]
cfg.StrOpt('cpu_power_governor_low',
           default='powersave',
           help='Governor to use in order to reduce CPU power consumption')

cfg.StrOpt('cpu_power_governor_high',
           default='performance',
           help='Governor to use in order to have best CPU performance')

Instance lifecycle

When an instance is spawned (or migrated or resumed), we will use the performance strategy to either online the core or use the best governor. When an instance is stopped (or powered off or suspended or shelved offload or in confirm-resize state on the source host), then we would either offline the core or use the powersaving governor.

Note that even if we say that this is the operator responsibility to verify whether their compute kernels support the two above strategies, we will return an exception if when trying to either online the core or modify the governor, so the instance could eventually be on the ERROR state. Also, if we can’t offline (or powersave) the CPU core when we stop the instance, then we would provide a WARNING log in the compute logs.

Alternatives

We could just do the first step and provide a way to disable checking the CPU online state in the libvirt driver with no synchronization, but this would require the operators to statically manage their cloud, which is cumbersome.

We could do this directly in nova and amend Nova everytime we want a new usecase. Not sure we’d appreciate this.

We could make reporting CPUs to the Placement API controlable via config, but this would only solve one usecase and would still require some tooling for playing with the config option.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Upgrade impact

None, config defaults disable this feature.

Implementation

Assignee(s)

Primary assignee:: bauzas
Other contributors:: sean-k-mooney

Feature Liaison

Feature liaison:: N/A

Work Items

Add a config option for cpu state management [1]
Add a config option for cpu tuning
provide a cpu framework for managing cpu core tuning thru sysfs
modify libvirt to online/performance a CPU core when an instance is spawning
modify libvirt to offline/powersave a CPU core when an instance is stopped
amend init_host() to put CPU cores to low performance (or offline)

Dependencies

None

Testing

Testing of this spec will be done with unit and functional tests.

Documentation Impact

Well, usual bits.

References

History

Revisions
Release Name	Description
Yoga	Introduced
Antelope	Reproposed

Add configuration options to set SPICE compression settings

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/nova-support-spice-compression-algorithm

This spec proposes to add SPICE-related options to a Nova configuration. These options can be used to enable and set the SPICE compression settings for libvirt (QEMU/KVM) provisioned instances. Note that those options are only taken into account if SPICE support is enabled (and the VNC support is disabled).

Problem description

Sometimes, network bandwidth is limited especially if physical network hardware is involved in an OpenStack setup, e.g. if old network switches with limited uplink bandwidth are used. Nevertheless, a data-intensive transfer of console data between compute nodes and remote console clients should be possible in such an infrastructure. Here it would be beneficial if builtin compression settings could be activated for transport protocols (currently only SPICE) in order to transmit graphic-intense desktop content in networks with limited bandwidth while gaining an acceptable quality of experience (QoE).

Use Cases

An operator should be able to decide how to configure a desktop (console) transport via SPICE. In particular, he should be able to configure the SPICE compression algorithms and modes in order to
- lower network bandwidth for graphical console accesses from (remote) networks with limited bandwidth. Users can benefit from such a configuration especially if they access a graphical console of an instance from home.
- completely turn off default compression settings for local console accesses while keeping latency as low as possible within a local (wired) network. Such a configuration can be useful if users should only have local access to graphical instances for visualizing computation results.
- select an appropriate SPICE video detection/streaming for graphic-intense use cases such as office work, media editing, and visualization of computation results, depending on the available network bandwidth and the QoE to be achieved.
A user should be able to access the graphical console of an instance from Horizon’s built-in spice-html5 client as before, even from (remote) networks with limited bandwidth (e.g. from home).

Proposed change

This spec proposes to add configuration options for all transport protocols in OpenStack that support the explicit activation of builtin compression settings. Currently only the integrated SPICE protocol allows the activation of various image [1] and video [2] compression settings to lower the network bandwidth while improving data transmission for graphic-intense desktops. Since SPICE is only supported by the libvirt hypervisor (through the QEMU backend), all other hypervisors and transport protocols are not affected by this proposed change. Libvirt already provides an automatic configuration of SPICE-related compression settings for the QEMU backend (see the spice documentation in the libvirt XML domain documentation [3]). Therefore, the change only requires to make the libvirt hypervisor driver capable to generate a valid libvirt XML config with activated SPICE compression settings. The OpenStack configuration for the config generation should be stored in the spice configuration group of a Nova configuration. This configuration group should be extended with configuration options that are capable of specifying the SPICE-related compression settings (choose compression algorithms and toggle compression modes on/off).

Alternatives

None.

Data model impact

None.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

The following SPICE-related options will be added to the spice configuration group of a Nova configuration:

image_compression = [ auto_glz | auto_lz | quic | glz | lz | off ];
jpeg_compression = [ auto | never | always ];
zlib_compression = [ auto | never | always ];
playback_compression = [ True | False ];
streaming_mode = [ filter | all | off ];

Each configuration option is optional and can be set explictly to configure the associated SPICE compression setting for libvirt. If all configuration options are not set, then none of the SPICE compression settings will be configured for libvirt, which corresponds to the behavior before this proposed change. In this case, the built-in defaults from the libvirt backend (e.g. QEMU) are used.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:: bahnwaerter
Other contributors:: None

Feature Liaison

Feature liaison:: None

Work Items

Add SPICE-related configuration options to the Nova configuration.
Create documentation for the SPICE-related configuration options.
Extend the SPICE config generation in the libvirt hypervisor driver.

Dependencies

None.

Testing

Implement unit tests for each function to cover testing of added and changed methods.

Documentation Impact

Extend the Nova configuration documentation and add documentation for the SPICE-related compression settings.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced

PCI Device Tracking In Placement

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/pci-device-tracking-in-placement

The OpenStack Placement service was designed to provide tracking of quantitative resources via resource class inventories and qualitative characteristics via traits. Over the last few cycles, nova has utilized Placement to track basic resources such as CPUs, RAM, and disk, and more complex resources such as Virtual GPUs. This spec describes how Nova can utilize Placement to track generic PCI devices without going into the details of the NUMA awareness of such devices.

Problem description

Nova has supported generic stateless PCI passthrough for many releases using a dedicated PCI tracker in conjunction with a PciPassthroughFilter scheduler post filter.

The PCI tracker is responsible for tracking which PCI devices are available, claimed, and allocated, the capabilities of the device, its consumer when claimed or allocated as well as the type of PCI device and location.

The PciPassthroughFilter is responsible for ensuring that devices, requested by the VM, exist on a host during scheduling. These PCI requests come from two sources: flavor-based PCI requests that are generated using the pci_passthrough:alias flavor extra specs and neutron based PCI requests generated from SR-IOV backed neutron ports.

While the current approach to PCI tracking works there are some limitations in the current design and there is room for optimization.

Limitations

During server creation PCI devices are not claimed until the instance_claim is created on the compute node. As a result, it is possible for two concurrent server create requests to race for the last device on a host resulting in re-schedules.
While Nova today tracks the capabilities of network interfaces in the extra_info field of the pci_devices table and the PciPassthroughFilter could match on those capabilities there is no user-facing way to express a request for an SR-IOV neutron port with a specific network capability e.g. TSO.
There is no admin-facing interface to check the available and allocated PCI resources in the cloud. The only way is to look into the Nova database.

Optimizations

Today when the virt driver is assigning a PCI device on the compute hosts it needs to look at all available PCI devices on the host and select one that fulfills the PCI and NUMA requirements. If we model PCI devices in Placement we only need to consider the devices associated with the PCI resource allocation in Placement.
Today when we schedule we perform host filtering of viable hosts based on PCI devices in python. By utilizing Placement we can move that filtering to SQL.

Use Cases

As an operator I want instance creation to atomically claim resources to decrease the chance of retries.
As an operator, I want to shorten the time it takes to select a host by running fewer filters.
As an operator, I want to utilize traits and resource classes to model PCI aliases and requests for more expressive device management.
As an operator, I want to be able to associate quotas with PCI device usage.
As an operator, I want to be able to use the Placement API to query available and allocated PCI devices in my cloud.

Note

Device quotas would require unified limits to be implemented. Implementing quotas is out of the scope of this spec beyond enabling the use case by modeling PCI devices in Placement.

This spec will also only focus on flavor-based PCI passthrough. Neutron SR-IOV port will be addressed in a follow-up spec to limit the scope.

Proposed change

Important

The inventory tracking and allocation healing part of this feature already been implemented in the Zed release. Still this spec contains the description of the already implemented parts as well to server as context. The specification of the missing pieces starts at Scheduling.

Opt-in reporting of PCI devices in Placement

To support upgrade of existing deployments with PCI passthrough configured and to be able to deprecate and eventually remove some of the functionality of the current PCI device tracker the new Placement based PCI device tracking will be disabled by default in the first release. The new [pci]report_in_placement config option can be used to enable the functionality. It will be defaulted to False first and once it is turned to True nova-compute will refuse to start if disabled again. In a future release, after the PCI tracking in Placement is feature complete, the default will be changed to True.

PCI device_spec configuration

Below we propose a change to the [pci]passthrough_whitelist configuration option. While we are making this change we will take this opportunity to update the name of the configuration option. The old name of the [pci]passthrough_whitelist config option will be deprecated for eventual removal and a new name [pci]device_spec will be added. Both the old and the new name will support the newly proposed resource_class and traits tags.

The syntax of the PCI passthrough device list configuration option will be extended to support two additional standard tags resource_class and traits. These new tags will only take effect if the [pci]report_in_placement config option is set to True.

device_spec = {
  "vendor_id": "1002",
  "product_id": "67FF",
  "resource_class":"CUSTOM_GPU",
  "traits": "CUSTOM_RADEON_RX_560", "CUSTOM_GDDR5"
}
device_spec = {
  "address": "0000:82:00.0",
  "resource_class":"CUSTOM_FPGA",
  "traits": "CUSTOM_XILINX_XC7VX690T"
}

The resource_class tag will be accepted only when the physical_network tag is not defined and will enable a PCI device to be associated with a custom resource class. Each PCI device_spec entry may have at most one resource class associated with it. Devices that have a physical_network tag will not be reported in Placement at this time as Neutron based SR-IOV is out of the scope of the current spec.

Where a PCI device does not have a physical_network or a resource_class tag present it will be reported with a generated custom resource class. The resource class will be CUSTOM_PCI_<vendor_id>_<product_id>.

The traits tag will be a comma-separated list of standard or custom trait names that will be reported for the device RP in Placement.

Nova will normalize and prefix the resource class and trait names with CUSTOM_, if it isn’t already prefixed, before creating them in Placement. Nova will first check the provided trait name in os_traits and if it exists as a standard trait then that will be used instead of creating a custom one.

Note

Initially traits will only be additive, in the future if we need to we can allow traits to be removed using a +/- syntax but this is not included in the scope of this spec.

As detailed in the Modeling PCI devices in Placement section, each physical device (PF) will be its own resource provider with inventories of the relevant PF and VF resource classes. As such traits cannot vary per VF device under the same parent PF. If VFs are individually matched by different device_spec entries, then defining different traits for different VFs under the same PF is a configuration error and will be rejected.

While it would possible to support defining different resource_class names for different VFs under the same parent PF, this is considered bad practice and unnecessary complexity. Such configuration will be rejected.

Note

Nova will detect if the resource_class or traits configuration of an already reported device is changed at a nova-compute service restart. If the affected device is free then Nova will apply the change in Placement. If the device is allocated then changing the resource_class would result in removing of existing allocations which is rejected by placement and therefore the compute service will refuse to start.

Note

In the future when PCI tracking in Placement will be extended to device_spec entries with physical_network tag, these entries will not allow specifying a resource_class but nova will use the standard SRIOV_NET_VF, PCI_NETDEV and VDPA_NETDEV classes. This will not prevent type-VF and type-PF devices to be consumed via PCI alias, as the alias can request these standard resource classes too.

The new Placement based PCI tracking feature won’t support the devname tag in the [pci]device_spec configuration. Usage of this tag is already limited as not all PCI devices has a device name. Also devname only works properly if the names are kept stable across hypervisor reboots and upgrades. If the [pci]report_in_placement is set to True and the [pci]device_spec has any entry with devname tag then the nova-compute service will refuse to start.

Modeling PCI devices in Placement

PCI device modeling in Placement will closely mirror that of vGPUs. Each PCI device of type type-PCI and type-PF will be modeled as a Placement resource provider (RP) with the name <hypervisor_hostname>_<pci_address>. The hypervisor_hostname prefix will be the same string as the name of the root RP. The pci_address part of the name will be the full PCI address in the same format of DDDD:BB:AA.FF.

Note

The pGPU RPs are using the libvirt nodedev name but this spec is not try to follow that naming scheme as the libvirt nodedev names are not considered stable. Also nova always uses the RP UUID to identify and RP instead of its name. So these names are only for troubleshooting purposes.

Each PCI device RP will have an inventory of resource class and traits based on the [pci]device_spec entry matching with the given device. If the device has children devices (VFs) matching with any device_spec entry then the resource inventory and traits of the children will be reported to the parent PF RP too.

If a PCI device is matching a device_spec entry without a physical_network tag then an inventory of 1 is reported of the resource_class specified in the matching device_spec entry or if resource_class is not specified there then with the generated CUSTOM_PCI_<vendor_id>_<product_id> resource class.

If a type-VF device is matching a device_spec entry then the related resource inventory will be reported on RP representing its parent PF device. The PF RP will be created even if the type-PF device is not matching any device_spec entry but in that case, only VF inventory will exist on the RP.

If multiple VFs from the same parent PF is matching the device_spec then the total resource inventory of VFs will be the total number of matching VF devices.

Each PCI device RP will have traits reported according to the traits tag of the matching device_spec entry. Additionally, Nova will report the COMPUTE_MANAGED_PCI_DEVICE standard trait on the device RPs automatically. This is used by the nova-compute service to reject a reconfiguration where [pci]report_in_placement is disable after it was enabled.

Listing both the parent PF device and any of this children VF devices at the same time will not be support if [pci]report_in_placement is enabled. See Dependent device handling section for more details.

Note

Even though neutron-requested PCI devices are out of the scope of this spec the handling of type-PF and type-VF devices cannot be ignored as those device types can also be requested via PCI alias by setting the device_type tag accordingly.

Note

The PCI alias can only request devices based on vendor_id and product_id today and that information will be automatically included in the Placement inventory as the resource class.

Note

In the future Nova can be extended to automatically report PCI device capabilities as custom traits in placement. However this is out of scope of the current spec. If needed the deployer can add these traits via the [pci]device_spec configuration manually.

Reporting inventories from the ResourceTracker to Placement

The ResourceTracker and the PciDevTracker implements a virt driver agnostic PCI device inventory and allocation handling. This logic is extended to provide PCI inventory information to Placement by translating PciDevice and PciDeviceSpec objects to Placement resource providers, resource inventories, and traits.

This new translator logic also capable of healing missing PCI resource allocations of existing instances based on the instance_uuid field of the allocated PciDevice objects. The missing allocations will be created in Placement via the /reshape API.

To aid the PCI scheduling via placement this logic also records the UUID of the resource provider representing a PCI device into the PciDevice object. Then the existing PCI pooling logic will translate such mapping to a PCI device pool, resource provider UUID mapping. Note that the scheduler needs one to one mapping between resource provider and PCI device pool, so the PCI pooling logic is changed to represent each type-PCI and PF devices as separate pools and only pool together VFs from the same parent PF to the same Pool.

The inventory and allocation handling logic will run in the update_available_resource periodic as well as during resource tracked update due to instance actions.

The allocation healing part of this implementation is temporary to support upgrading existing deployments with PCI allocations to the new Placement based logic. As soon as a deployment is upgraded and the scheduler logic is enabled the healing is expect to be noop as the scheduler creates all the necessary allocation in Placement. Therefore we plan to remove the healing logic from the codebase in a future release.

Note

The compute restart logic needs to handle the case when a device is not present any more either due to changes in the [pci]device_spec config option or due to a physical device removal from the hypervisor. The driver needs to modify the VF resource inventory on the PF RP (when a VF is removed) or delete the PF RP (if the PF is removed and no children VFs matched). Nova cannot prevent the removal of a PCI device from the hypervisor while the device is allocated to a VM. Still Nova will emit a warning in such case.

PCI alias configuration

The PCI alias definition in [pci]alias configuration option will be extended to support two new tags, resource_class, traits. The resource_class tag can hold exactly one resource class name. While the traits tag can hold a comma-separated list of trait names. Also trait names in traits can be prefixed with ! to express a forbidden trait. When the resource_class is specified, the vendor_id and product_id tags will no longer be required.

Note

If both resource_class and vendor_id and product_id fields are provided in the alias then Nova will use the resource_class for the Placement query but the vendor_id and product_id filtering will happen in the PciPassthroughFilter.

Note

Later if more complex trait requirements are needed we can add support for multiple traits tag by adding a free postfix. Also later we can add support for in: prefix in the value of the traits tag to express an OR relationship. E.g.

{
    "traits1": "T1,!T2",
    "traits2": "in:T3,T4"
}

Requesting PCI devices

The syntax and handling of the pci_passthrough:alias flavor extra specs will not change. Also, Nova will continue using the InstancePCIRequest to track the requested PCI devices for a VM.

Scheduling

The RequestSpec creation logic is extended to translate InstancePCIRequest objects to RequestGroup objects and store the new groups in the resource_request field of the RequestSpec. At this time nova will only translate flavor based InstancePCIRequests the neutron port based requests will be handled in a later release.

This translation logic is disable by default and can be enabled via the new [filter_scheduler]pci_in_placement configuration after every compute in the deployment is upgraded and the [pci]report_in_placement configuration option is enabled.

To be able to unambiguously connect InstancePCIRequest to RequestGroups the request_id field of the InstancePCIRequest object always needs to be filled to a UUID. In the past nova only filled that field for neutron based requests.

A single InstancePCIRequest object can potentially request multiple devices as the count field can be set to greater than 1 for flavor based request. In this case a single request object is split into multiple RequestGroup objects to allow fulfilling those device requests from independent resource providers. The requester_id of the resulting RequestGroup objects are filled with a value generated by the InstancePCIRequest.request_id-<index> formula where index is a runing index between 0..``count`` from the request.

The resources and required_traits filed of the RequestGroup object is filled based on the spec field of the InstancePCIRequest that in turn are filled from the fields of the matching [pci]alias entry requested via the flavor extras_spec. If a request comes from an alias that does not have a resource_class associated with it, then it will be defaulted to CUSTOM_PCI_<vendor_id>_<product_id>.

The existing scheduler implementation can be used to generate the /allocation_candidates query to Placement including the new PCI related groups.

Dependent device handling

Today nova allows matching both a parent PF and its children VFs in the configuration and these devices are tracked as separate resources. However, they cannot be consumed independently. When the PF is consumed its children VFs become unavailable. Also when a VF is consumed its parent PF becomes unavailable. This dynamic device type selection will be deprecated and the new Placement based PCI tracking will only allow configuring either the PF device or its children VF devices. The old PCI tracker will continue support this functionality but as soon as [pci]report_in_placement is set to True on a compute that compute will reject configurations that are enabling both the PF and in children VFs.

PCI NUMA affinity

The PCI NUMA affinity code (mostly in hardware.py) will need to be modified to limit the PCI devices considered to just those included in the allocation candidate summary. Also at the same time, this code should provide information to the scheduler about which allocation candidate is valid from affinity perspective.

To enable this the allocation candidates will be added to the HostState object of the filter scheduler. The PciPassthroughFilter and NUMATopologyFilter will then need to pass the allocation candidates to the hardware.py functions which will need to remove any allocation candidates from that list that do not fulfill the PCI or NUMA requirements. The filter should then pop any invalid allocation candidates from the HostState object. At the end of the scheduling process, the filter scheduler will have to reconstruct the host allocation candidate set from the HostState object.

By extending the HostState object with the allocation candidate we will enable the filters to filter not just by the host but optionally by the allocation candidates of the host without altering the filter API therefore maintaining compatibility with external filters.

The PCI stats module

The stats module will have to be enhanced to support allocation aware claims. To the PciDevicePool object needs to be mapped to resource providers. This will be done by the PCI device inventory reporting logic in the PciDevTracker. During a scheduling attempt the scheduler filters can provide the resource provider UUIDs that the current allocation candidate is mapped to to restrict the PCI fitting logic according to the candidate.

After the scheduling decision is made the selected mapping is recorded into the InstancePCIRequest objects. So that during the PCI claim logic this information will be provider from those objects to ensure that the claim consumes PCI devices that are allocated for this request in Placement.

VM lifecycle operations

The initial scheduling is very similar to the later scheduling done due to move operations. So, the existing implementation can be reused. Also, the current logic that switches the source node Placement allocation to be held by the migration UUID can be reused.

Live migration is not supported with PCI alias-based PCI devices and this will not be changed by the current spec.

Attaching and detaching PCI devices are only supported via Neutron SR-IOV ports and that is out of the scope of this spec.

Neutron SR-IOV ports (out of scope)

This is out of scope in the current spec. But certain aspects of the problem are already known and therefore listed here.

There are a list of Neutron port vnic_type (e.g. direct, direct-physical,etc) where the port needs to be backed by VF or PF PCI devices.

In the simpler case when a port only requires a PCI device but does not require any other resources (e.g. bandwidth) then Nova needs to create Placement request groups for each Neutron port with the already proposed prefilter. See Scheduling for more details. In this case, neither the name of the resource class nor the vendor ID, product ID pair is known at scheduling time (compared to the PCI alias case) therefore the prefilter does not know what resource class needs to be requested in the Placement request group.

To resolve this, PCI devices that are intended to be used for Neutron-based SR-IOV should should not use the resource_class tag in the [pci]device_spec. Instead Nova will use standard resource classes to model these resource.

Today nova allows consuming type-PCI or type-VF for direct ports. This is mostly there due to historical reasons and it should be cleaned up. A better device categorization is suggested:

A device in the device_spec will be consumable only via PCI alias if it does not have physical_network tag attached.
A device that has physical_network tag attached will be considered a network device and it will be modelled as PCI_NETDEV resource.
A device that has physical_network tag and also has the capability to provide VFs will have a trait HW_NIC_SRIOV but still use the PCI_NETDEV resource class.
A device that has physical_network tag and is a VF will be modelled as SRIOV_NET_VF resource.

This way every Neutron vnic_type can be mapped to one single resource class by Nova. The following vnic_type -> resource class mapping is suggested:

direct, macvtap, virtio-forwarder, remote-managed -> SRIOV_NET_VF
direct-physical -> PCI_NETDEV
vdpa -> VDPA_NETDEV

Nova will use these resource classes to report device inventories to Placement. Then the prefilter can translate the vnic_type of the ports to request the specific resource class during scheduling.

Another specialty of Neutron-based SR-IOV is that the devices listed in the device_spec always have a physical_network tag. This information needs to be reported as a trait to the PF RP in Placement. Also, the port’s requested physnet needs to be included in the Placement request group by the prefilter.

There is a more complex case when the Neutron port not only requests a PCI device but also requests additional resources (e.g. bandwidth) via the port resource_request attribute. In this case, Nova already generates Placement request groups from the resource_request and as in the simple case will generate a request group from the PCI request. The resource request of these groups of a neutron port needs to be correlated to ensure that a port gets the PCI device and the bandwidth from the same physical device. However today the bandwidth is modeled under the Neutron RP subtree while PCI devices will be modeled right under the root RP. So the two RPs to allocate from are not within the same subtree. (Note that Placement always fulfills a named request group from a single RP but allows correlating such request groups within the same subtree.) We have multiple options here:

Create a scheduler filter that removes allocation candidates where these request groups are fulfilled from different physical devices
Report the bandwidth and the PCI device resource on the same RP. This breaks the clear ownership of a single RP as the bandwidth is reported by the neutron agent while the PCI device is reported by Nova.
Move the two RPs (bandwidth and PCI dev) into the same subtree. This needs an agreement between Nova and Neutron devs where to move the RPs and needs an extra reshape to implement the move.
Enhance Placement to allow sharing of resources between RPs within the same RP tree. By that, we could make the bandwidth RP a sharing RP that shares resources with the PCI device RP representing the physical device.

Based on the selected solution either:

Neutron requests the specific resource class for the SRIOV port via the port resource_request field
Nova can include these resources to the request when the InstancePCIRequest objects are created based on the requested ports.

Alternatives

We could keep using the legacy tracking with all its good and bad properties.
We could track each PCI device record as a separate RP. This would result in each VF having its own RP allowing each VF to have different traits. This is not proposed as it will significantly increase the possible permutations of allocation candidates per host.
We could keep supporting the dynamic PF or VF consumption in Placement but it is deemed more complex than useful. We will keep supporting it via the legacy code path but the new code path will not support it.
We could model each PCI device under a NUMA node. This can be done in the future by moving the RP under a NUMA node RP instead of the compute node RP but it is declared out of the scope of this initial spec.

Data model impact

InstancePCIRequest object will be extended to include the required and forbidden traits and the resource class requested via the PCI alias in the flavor and defined in the PCI alias configuration.

PciDevicePool object will be extended to store a resource provider UUID so that the PCI device allocated in Placement can be correlated to the PCI device to be claimed by the PCI tracker.

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

In general, this is expected to improve the scheduling performance but should have no runtime performance impact on guests.

The introduction of new PCI RequestGroup objects will make the computation of the placement query slightly longer and the resulting execution time may increase for instances with PCI requests but should have no effect for instances without PCI requests. This added complexity is expected to be offset the result of the offloading of the filtering to Placement and the removal of reschedules due to racing for the last PCI device on a host, the overall performance is expected to improve.

Other deployer impact

To utilize the new feature the operator will have to define two new config options. One to enable the placement scheduling logic and a second to enable the reporting of the PCI devices to Placement.

Developer impact

None

Upgrade impact

The new Placement based PCI tracking will be disabled by default. Deployments already using PCI devices can freely upgrade to the new Nova version without any impact. At this state the PCI device management will be done by the PciPassthroughFilter in the scheduler and the PCI claim in the PCI device tracker in the compute service same as in the previous version of Nova. Then after the upgrade the new PCI device tracking can be enabled in two phases.

First the PCI inventory reporting needs to be enabled by [pci]report_to_placement on each compute host. During the startup of the nova-compute service with [pci]report_to_placement = True config the service will do the reshape of the provider tree and start reporting PCI device inventory to Placement. Nova compute will also heal the PCI allocation of the existing instances in Placement. This healing will be done for new instances with PCI requests until a future release where the prefilter enabled by default. This is needed to keep the resource usage in sync in Placement even if the instance scheduling is done without the prefilter requesting PCI allocations in Placement.

Note

Operators are encouraged to take the opportunity to rename the [pci]passthrough_whitelist config option to the new [pci]device_spec option. The syntax of the two options are the same.

Note

The devname tag is not supported in the [pci]device_spec and in the [pci]passthrough_whitelist any more if [pci]report_to_placement is enabled. We suggest to use the address tag instead.

Note

If the deployment is configured to rely on the dynamic dependent device behavior, i.e. both the PF and its children VFs are matching the device_spec then reconfiguration will be needed as the new code patch will not support this and the nova-compute service will reject to start with such configuration. To do the reconfiguration the deployer needs to look at the current allocation of the PCI devices on each compute node:

if neither the PF nor any of its children VFs are allocated then the deployer can decide which device(s) kept in the device_spec.
if the PF is already allocated then the PF needs to be kept in the device_spec but all children VFs has to be removed.
if any of the children VF device is allocated then the parent PF needs to be removed from the device_spec and at least the currently allocated VFs needs to be kept in the config, while other non allocated children VFs can be kept or removed from the device_spec at will.

Note

Once [pci]report_to_placement is enabled for a compute host it cannot be disabled any more.

Second, after every compute has been configured to report PCI inventories to Placement the scheduling logic needs to be enabled in the nova-scheduler configuration via the [filter_scheduler]pci_in_placement configuration option.

Implementation

Assignee(s)

Primary assignee:: balazs-gibizer

Feature Liaison

Feature liaison:: sean-k-mooney

Work Items

translate InstancePCIRequest objects to RequestGroup objects in the RequestSpec
support adding resource class and required traits to PCI alias
split PCI pools by PCI or PF devices
map each PCI pool to the RP it represents
extend the HostState object with an allocations candidate list
change the PciPassthroughFilter and the NUMATopologyFilter to filter on allocation candidates
extension of stats and hardware module to consider allocation candidates when filtering for PCI NUMA affinity.
store the allocated RP UUIDs in the InstancePCIRequest
extend the PCI claim code path to consume devices based on the placement allocations.
ensure that InstancePCIRequest to RequestGroup translation happens before each move operation.

Dependencies

The unified limits feature exists in an opt-in, experimental state and will allow defining limits for the new PCI resources if enabled.

Testing

As this is a PCI passthrough related feature it cannot be tested in upstream tempest. Testing will be primarily done via the extensive unit and functional test suites that exists for instances with PCI devices and NUMA topology in the libvirt functional tests.

Documentation Impact

The PCI passthrough doc will have to be rewritten to document the new resource_class and trait tags for the PCI device_spec and PCI alias.

References

CPU resource tracking spec: https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
Unified Limits Integration in Nova: https://specs.openstack.org/openstack/nova-specs/specs/ussuri/approved/unified-limits-nova.html
Support virtual GPU resources: https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html

History

Revisions
Release Name	Description
Xena	Introduced
Zed	Extended and re-proposed. The inventory tracking and allocation healing has been implemented.
2023.1 - Antelope	Re-proposed to finish the scheduling support

Stable Compute UUIDs

Wed, 08 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/stable-compute-uuid

Problem description

The nova-compute service does not have a strong correlation with the unique identifier used to represent itself. In most cases, we use the system hostname as the identifier by which we locate our Service and ComputeNode records in the database. However, hostnames can change (both intentionally and unintentionally) which makes this problematic. The Nova project has long said “don’t do that” although in reality, we must be less fragile and able to detect and protect against database corruption if it happens.

Use Cases

As an operator, I want nova to be able to survive an accidental system hostname change without damage or silent data corruption.

As an operator, I want nova to detect a hostname mismatch and avoid corrupting its database.

As a deployment tool developer, I want to be able to pre-generate the UUID for a given compute host being deployed so that I will know it ahead of time, before starting the service.

Proposed change

Nova will use a persistent file for storing the compute node UUID for non-Ironic environments. If this file does not exist on startup, then it will be created once and only once. This UUID will serve to provide a stable lookup of the ComputeNode object in the database which represents a given nova-compute instance. This identification file should be able to live in a non-writable (by nova-compute) location and treated like config, but also in a writable location and treated like state. The latter is important to avoid adding a required mandatory deployment step.

The compute service will use this locally-persisted UUID to reliably find the ComputeNode record, and will check for a potential hostname (or CONF.host) change on startup. If such a rename is detected, nova-compute will fail to start and warn about the situation.

This file will be named compute_id and will be honored in the first location found in any of the following locations:

The parent directory of any file in CONF.config_files
The directory specified in CONF.state_path

For safety, all of the above locations will always be searched and any compute_id files found will be examined. If there are any discrepancies (i.e. more than one files with non-identical contents), an error will be logged and nova-compute will refuse to start.

The file format will be a single 36-character string containing a UUID in canonical text representation (i.e. uuidgen > /path/to/file).

If nova-compute is started and no compute_id file is found, it will be created once and initialized with a UUID in the CONF.state_path location.

For configurations where the driver is set to Ironic, we will do no persistence of the compute node, since there is not a 1:1 mapping between nova-compute instances and Ironic nodes. The mapping that Ironic pushes up (via get_available_nodes()) will be assumed to be correct.

Note that all drivers in Nova other than Ironic manage a single compute node. Ironic is “special” in this regard and thus will be special-cased for this effort.

Alternatives

We could choose a more complex format with room for additional data or attributes in the future. I would argue that files are cheap, easy(er) for deployment tools to write (i.e. uuidgen > /path/to/file), and avoids the potential need for versioning and migration.

We could make CONF.hostname not optional and not defaulted to socket.gethostname(). This may be a simpler approach, but it is unlikely to be favored by deployers and deployment tool writers. It also does not provide a path to being able to actually support hostname changes in the future.

There is already some data persistence in the ${state_dir}/instances/compute_nodes file, which is JSON-encoded and maintained by the image cache code. I think this is a less-good idea because it’s stored in a place that is potentially shared among multiple (but not all) compute nodes and thus may provide a difficult path to stable “who am I?” determination.

We could use /etc/machine-id or some amount of it. It’s not a UUID, but it’s close. It’s also a freedesktop/systemd thing and may not exist everywhere, especially in a containerized environment.

Data model impact

Right now we generate new UUIDs for records in compute_nodes in two ways:

For most drivers, it occurs rather deep in the object, in the remotable create() method. That means they actually get generated on the conductor node, if the virt driver does not provide a uuid resulting in the resource tracker calling create() with no UUID specified.
For Ironic, the virt driver provides a uuid in the resources dict, which causes it to be created with the desired node id from the start.

So, while not a data model impact directly, this effort will move to always providing a ComputeNode.uuid value when the record is created, either because we read it from the persistent file, or pre-generated it to write the file.

REST API impact

None.

Security impact

The preferred location for the compute_id file is in one of the config file directories, which should be non-writable by Nova itself. If one is not provided, nova will create that file in CONF.state_dir which will leave it writable by the user under which nova-compute runs. This could potentially provide a path to disruption, although if an attacker gains access to write things owned by that user, all the instance disks and configs are similarly exposed.

Notifications impact

None.

Other end user impact

None.

Performance Impact

None.

Other deployer impact

The deployer will not be impacted by default, but will gain the ability to pin the compute node’s UUID as config, if desired.

Developer impact

None.

Upgrade impact

For the 2023.1 cycle, nova-compute will need to gracefully handle the case where there is a ComputeNode that represents its service, which has not yet been persisted to the compute_id file. We will need to communicate this in the release notes, warning of the danger of getting it wrong (which is pretty much the same as a rename today). For the period in which we support this compatibility behavior, we can use the Service.version that we find attached to our ComputeNode object to determine whether or not we should write an existing UUID to the compute_id file or generate it from scratch. In a subsequent release we should remove that behavior (although potentially retain a start-blocking check if the version is being upgraded across that boundary).

Implementation

Assignee(s)

Primary assignee:: danms

Feature Liaison

Feature liaison:: sean-k-mooney

Work Items

Write and test routines for reading, writing, and sanity-checking the compute_id files.
Wire up the init_host() logic to ensure the compatibility behavior of writing existing compute node UUIDs to the file.
Modify the existing compute node creation logic to honor/generate the persistent compute_id.

Dependencies

None.

Testing

Unit and functional testing will be sufficient coverage for this. We will get grenade and greenfield devstack coverage by default, and perhaps we can ensure that the file is created in job post scripts.

Documentation Impact

The installation guide will need changes to describe the purpose and behavior of this file. Obviously release notes will be needed for signaling.

References

This is part of a larger multi-cycle effort to robustify compute hostnames.

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced

Allow Manila shares to be directly attached to an instance when using libvirt

Tue, 07 Mar 2023 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-virtiofs-attach-manila-shares

Problem description

Use Cases

As an operator I want the Manila datapath to be separate to any tenant accessible networks.
As a user I want to attach Manila shares directly to my instance and have a simple interface with which to mount them within the instance.
As a user I want to detach a directly attached Manila share from my instance.
As a user I want to track the Manila shares attached to my instance.

Proposed change

A new server shares API will be introduced under a new microversion. This will list current shares, show their details and allow a share to be attached or detached.

Note

The libvirt driver will be extended to support the above with initial support for cold attach and detach. Future work will aim to add live attach and detach once support lands in libvirt itself.

COMPUTE_STORAGE_VIRTIO_FS trait

and either the

COMPUTE_MEM_BACKING_FILE trait

that the instance is configured with hw:mem_page_size extra spec.

From an operator’s point of view, it means COMPUTE_STORAGE_VIRTIO_FS support requires that operators must upgrade all their compute nodes to the version supporting shares using virtiofs.

Users will be able to mount the attached shares using a mount tag, this is either the share UUID from Manila or a string provided by the users with their request to attach the share.

user@instance $ mount -t virtiofs $tag /mnt/mount/path

Share mapping status:

                     +----------------------------------------------------+   Reboot VM
    Start VM         |                                                    | --------------+
    Share mounted    |                       active                       |               |
+------------------> |                                                    | <-------------+
|                    +----------------------------------------------------+
|                      |                   |             |
|                      | Stop VM           |             |
|                      | Fail to umount    |             |
|                      v                   |             |
|                    +------------------+  |             |
|                    |      error       | <+-------------+-------------------+
|                    +------------------+  |             |                   |
|                      |                   |             |                   |
|                      | Detach share or   |             |                   |
|                      | delete VM         | Delete VM   |                   |
|                      v                   |             |                   |
|                    +------------------+  |             |                   |
|    +-------------> |        φ         | <+             |                   | Start VM
|    |               +------------------+                |                   | Fail to mount
|    |                 |                                 |                   |
|    | Detach share    |                                 | Stop VM           |
|    | or delete VM    | Attach share                    | Share unmounted   |
|    |                 v                                 v                   |
|    |               +----------------------------------------------------+  |
|    +-------------- |                      inactive                      | -+
|                    +----------------------------------------------------+
|                      |
+----------------------+

φ: means no entry in the database. No association between a share and a server.
Attach share: means POST /servers/{server_id}/shares
Detach share: means DELETE /servers/{server_id}/shares

This chart describe the share mapping status (nova), this is independent from the status of the Manila share.

Umount operation will be really done when the share is mounted and not used anymore by another server.

With the above mount and umount operation, the state is stored in memory and do not require a lookup in the database.

Manila share removal issue:

Instance metadata:

Alternatives

REST API impact

A new server level shares API will be introduced under a new microversion with the following methods:

GET /servers/{server_id}/shares

List all shares attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "shares": [
        {
            "shareId": "48c16a1a-183f-4052-9dac-0e4fc1e498ad",
            "status": "active",
            "tag": "foo"
        },
        {
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar"
        }
    ]
}

GET /servers/{server_id}/shares/{shareId}

Show details of a specific share attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar"
    }
}

PROJECT_ADMIN will be able to see details of the attachment id and export location stored within Nova:

{
    "share": {
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

POST /servers/{server_id}/shares

Attach a share to an instance.

Prerequisite(s):

Instance much be in the SHUTOFF state.
Instance should have the required capabilities to enable virtiofs (see above).

Return Code(s): 202,400,401,403,404,409

Request body:

Note

tag will be an optional request parameter in the request body, when not provided it will be the shareId(UUID) as always provided in the request.

tag if povided by the user must be an ASCII string with a maximum lenght of 64 bytes.

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

Response body:

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

DELETE /servers/{server_id}/shares/{shareId}

Detach a share from an instance.

Prerequisite(s): Instance much be in the SHUTOFF or ERROR state.

Return Code(s): 202,400,401,403,404,409

Data model impact

A new share_mapping database table will be introduced.

id - Primary key autoincrement
uuid - Unique UUID to identify the particular share attachment
instance_uuid - The UUID of the instance the share will be attached to
share_id - The UUID of the share in Manila
status - The status of the share attachment within Nova (active, inactive, error)
tag - The device tag to be used by users to mount the share within the instance.
export_location - The export location used to attach the share to the underlying host
share_proto - The Shared File Systems protocol (NFS, CEPHFS)

A new base ShareMapping versioned object will be introduced to encapsulate the above database entries and to be used as the parent class of specific virt driver implementations.

Fields containing text will use String and not Text type in the database schema to limit the column width and be stored inline in the database.

This base ShareMapping object will provide stub attach and detach methods that will need to be implemented by any child objects.

New ShareMappingLibvirt, ShareMappingLibvirtNFS and ShareMappingLibvirtCephFS objects will be introduced as part of the libvirt implementation.

Security impact

Notifications impact

New notifications will be added:

One to add new notifications for share attach and share detach.
One to extend the instance update notification with the share mapping information.

Share mapping in the instance payload will be optional and controlled via the include_share_mapping notification configuration parameter. It will be disabled by default.

Proposed payload for attached and detached notification will be the same as the one returned by the show command with admin rights.

{
    "share": {
        "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

Proposed instance payload for instance updade, will be the list of share attached to this instance.

{
    "shares":
    [
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar",
            "export_location": "server.com/nfs_mount,foo=bar"
        },
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-ffffffffffff",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7987",
            "status": "active",
            "tag": "baz",
            "export_location": "server2.com/nfs_mount,foo=bar"
        }
    ]
}

Other end user impact

Users will need to mount the shares within their guestOS using the returned tag.

Users could use the instance metadata to discover and auto mount the share.

Performance Impact

Other deployer impact

None

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)
Other contributors:: lyarwood (initial contributor)

Feature Liaison

Feature liaison:: uggla

Work Items

Add new capability traits within os-traits
Add support within the libvirt driver for cold attach and detach
Add new shares API and microversion

Dependencies

None

Testing

Functional libvirt driver and API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Yoga	Introduced
Zed	Reproposed
Antelope	Reproposed
Bobcat	Reproposed

Example Spec - The title of your blueprint

Sun, 29 Jan 2023 00:00:00

Include the URL of your launchpad blueprint:

https://blueprints.launchpad.net/nova/+spec/example

Some notes about the nova-spec and blueprint process:

Not all blueprints need a spec. For more information see https://docs.openstack.org/nova/latest/contributor/blueprints.html#specs
The aim of this document is first to define the problem we need to solve, and second agree the overall approach to solve that problem.
This is not intended to be extensive documentation for a new feature. For example, there is no need to specify the exact configuration changes, nor the exact details of any DB model changes. But you should still define that such changes are required, and be clear on how that will affect upgrades.
You should aim to get your spec approved before writing your code. While you are free to write prototypes and code before getting your spec approved, its possible that the outcome of the spec review process leads you towards a fundamentally different solution than you first envisaged.
But, API changes are held to a much higher level of scrutiny. As soon as an API change merges, we must assume it could be in production somewhere, and as such, we then need to support that API change forever. To avoid getting that wrong, we do want lots of details about API changes upfront.

Some notes about using this template:

Your spec should be in ReSTructured text, like this template.
Please wrap text at 79 columns.
The filename in the git repository should match the launchpad URL, for example a URL of: https://blueprints.launchpad.net/nova/+spec/awesome-thing should be named awesome-thing.rst
Please do not delete any of the sections in this template. If you have nothing to say for a whole section, just write: None
For help with syntax, see http://sphinx-doc.org/rest.html
To test out your formatting, build the docs using tox and see the generated HTML file in doc/build/html/specs/<path_of_your_file>
If you would like to provide a diagram with your spec, ascii diagrams are required. http://asciiflow.com/ is a very nice tool to assist with making ascii diagrams. The reason for this is that the tool used to review specs is based purely on plain text. Plain text will allow review to proceed without having to look at additional files which can not be viewed in gerrit. It will also allow inline feedback on the diagram itself.
If your specification proposes any changes to the Nova REST API such as changing parameters which can be returned or accepted, or even the semantics of what happens when a client calls into the API, then you should add the APIImpact flag to the commit message. Specifications with the APIImpact flag can be found with the following query:

https://review.openstack.org/#/q/status:open+project:openstack/nova-specs+message:apiimpact,n,z

Problem description

A detailed description of the problem. What problem is this blueprint addressing?

Use Cases

What use cases does this address? What impact on actors does this change have? Ensure you are clear about the actors in each use case: Developer, End User, Deployer etc.

Proposed change

Here is where you cover the change you propose to make in detail. How do you propose to solve this problem?

If this is one part of a larger effort make it clear where this piece ends. In other words, what’s the scope of this effort?

Alternatives

Data model impact

Questions which need to be addressed by this section include:

What new data objects and/or database schema changes is this going to require?
What database migrations will accompany this change.
How will the initial set of new data objects be generated, for example if you need to take into account existing instances, or modify other existing data describe how that will work.

REST API impact

Each API method which is either added or changed should have the following

Specification for the method
- A description of what the method does suitable for use in user documentation
- Method type (POST/PUT/GET/DELETE)
- Normal http response code(s)
- Expected error http response code(s)
  - A description for each possible error code should be included describing semantic errors which can cause it such as inconsistent parameters supplied to the method, or when an instance is not in an appropriate state for the request to succeed. Errors caused by syntactic problems covered by the JSON schema definition do not need to be included.
- URL for the resource
  - URL should not include underscores, and use hyphens instead.
- Parameters which can be passed via the url
- JSON schema definition for the request body data if allowed
  - Field names should use snake_case style, not CamelCase or MixedCase style.
- JSON schema definition for the response body data if any
  - Field names should use snake_case style, not CamelCase or MixedCase style.
Example use case including typical API samples for both data supplied by the caller and the response
Discuss any policy changes, and discuss what things a deployer needs to think about when defining their policy.

Example JSON schema definitions can be found in the Nova tree https://opendev.org/openstack/nova/src/branch/master/nova/api/openstack/compute/schemas

Reuse of existing predefined parameter types such as regexps for passwords and user defined names is highly encouraged.

Security impact

Describe any potential security impact on the system. Some of the items to consider include:

Does this change touch sensitive data such as tokens, keys, or user data?
Does this change alter the API in a way that may impact security, such as a new way to access sensitive information or a new way to login?
Does this change involve cryptography or hashing?
Does this change require the use of sudo or any elevated privileges?
Does this change involve using or parsing user-provided data? This could be directly at the API level or indirectly such as changes to a cache layer.
Can this change enable a resource exhaustion attack, such as allowing a single API interaction to consume significant server resources? Some examples of this include launching subprocesses for each connection, or entity expansion attacks in XML.

Notifications impact

Please specify any changes to notifications. Be that an extra notification, changes to an existing notification, or removing a notification.

Consider proposing changes to the versioned notifications:

When the feature adds or removes fields to the API responses. For example when the feature adds a new field to the GET /servers API response consider adding similar information to the payload of the instance action notifications
When the feature adds a new action to the existing API entities. For example adding a new action to the server might mean you want to emit a corresponding new instance action notification
When the feature adds a new resource (noun) to the REST API consider adding new notifications about the creation and deletion of such resource

Other end user impact

Aside from the API, are there other ways a user will interact with this feature?

Does this change have an impact on python-novaclient and openstack client? What does the user interface there look like?

Performance Impact

Describe any potential performance impact on the system, for example how often will new code be called, and is there a major change to the calling pattern of existing code.

Examples of things to consider here include:

A periodic task might look like a small addition but if it calls conductor or another service the load is multiplied by the number of nodes in the system.
Scheduler filters get called once per host for every instance being created, so any latency they introduce is linear with the size of the system.
A small change in a utility function or a commonly used decorator can have a large impacts on performance.
Calls which result in a database queries (whether direct or via conductor) can have a profound impact on performance when called in critical sections of the code.
Will the change include any locking, and if so what considerations are there on holding the lock?

Other deployer impact

Discuss things that will affect how you deploy and configure OpenStack that have not already been mentioned, such as:

What config options are being added? Should they be more generic than proposed (for example a flag that other hypervisor drivers might want to implement as well)? Are the default values ones which will work well in real deployments?
Is this a change that takes immediate effect after its merged, or is it something that has to be explicitly enabled?
If this change is a new binary, how would it be deployed?
Please state anything that those doing continuous deployment, or those upgrading from the previous release, need to be aware of. Also describe any plans to deprecate configuration values or features. For example, if we change the directory name that instances are stored in, how do we handle instance directories created before the change landed? Do we move them? Do we have a special case in the code? Do we assume that the operator will recreate all the instances in their cloud?

Developer impact

Discuss things that will affect other developers working on OpenStack, such as:

If the blueprint proposes a change to the driver API, discussion of how other hypervisors would implement the feature is required.

Upgrade impact

Describe any potential upgrade impact on the system, such as:

If this change adds a new feature to the compute host that the controller services rely on, the controller services may need to check the minimum compute service version in the deployment before using the new feature. For example, in Ocata, the FilterScheduler did not use the Placement API until all compute services were upgraded to at least Ocata.
While we strive to have feature parity between all virt drivers, it is not uncommon for one virt driver to implement a new feature exposed out of the API before the others. For example, extending the size of an attached volume. Since Nova does not yet have any type of sophisticated capabilities API so a user can know what actions can be performed on a given instance, consider adding a new policy rule to at least let operators that cannot support a virt-specific feature disable it in their cloud which is at least presented to the user in an understandable way by getting a 403 Forbidden error.
Nova supports N-1 version nova-compute services for rolling upgrades. Does the proposed change need to consider older code running that may impact how the new change functions, for example, by changing or overwriting global state in the database? This is generally most problematic when making changes that involve multiple compute hosts, like move operations such as migrate, resize, unshelve and evacuate.

Implementation

Assignee(s)

Who is leading the writing of the code? Or is this a blueprint where you’re throwing it out there to see who picks it up?

If more than one person is working on the implementation, please designate the primary author and contact.

Primary assignee:: <launchpad-id or None>
Other contributors:: <launchpad-id or None>

Feature Liaison

Ideally feature work is sponsored by a member of the nova core team or other experienced and active nova developer. The purpose of a liaison is to:

Mentor developers through the arcana of nova’s development processes.
Advocate for (aka “care about”) the feature to the rest of the nova team.
Be the initial go-to for reviews.

See the Feature Liaison FAQ for more details.

Feature liaison:: <name and/or nick>

Feature liaison is optional. However we suggest to find a liaison for your feature as it will help getting your feature merged. The Feature Liaison FAQ has details about how to find a liaison for your work.
If you do not already have agreement from a nova developer to act as your liaison, you may write “Liaison Needed” here and/or in your commit message.
If you are a core or experienced nova dev, you need not have a separate liaison; if you wish, you may just assign yourself, or put “None”/”N/A”.

Work Items

Dependencies

Include specific references to specs and/or blueprints in nova, or in other projects, that this one either depends on or is related to.
If this requires functionality of another project that is not currently used by Nova (such as the glance v2 API when we previously only required v1), document that fact.
Does this feature require any new library dependencies or code otherwise not included in OpenStack? Or does it depend on a specific version of library?

Testing

Is this untestable in gate given current limitations (specific hardware / software configurations available)? If so, are there mitigation plans (3rd party testing, gate enhancements, etc).

Documentation Impact

References

Links to mailing list or IRC discussions
Links to notes from a summit session
Links to relevant research, if appropriate
Related specifications as appropriate (e.g. if it’s an EC2 thing, link the EC2 docs)
Anything else you feel it is worthwhile to refer to

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
2023.2 Bobcat	Introduced

Add maxphysaddr support for Libvirt

Thu, 15 Dec 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-maxphysaddr-support

This blueprint propose new flavor extra_specs to control the physical address bits of vCPUs in Libvirt guests.

Problem description

When booting a guest with 1TB+ RAM, the default physical address bits are too small and the boot fails [1]. So a knob is needed to specify the appropriate physical address bits.

Use Cases

Booting a guest with large RAM.

Proposed change

<maxphysaddr mode='emulate' bits='42'/>
<maxphysaddr mode='passthrough'/>

Flavor extra_specs

Here I suggest the following two flavor extra_specs. Of course, if these are omitted, the behavior is the same as before.

hw:maxphysaddr_mode can be either emulate or passthrough.
hw:maxphysaddr_bits takes a positive integer value. Only meaningful and must be specified if hw:maxphysaddr_mode=emulate.

Nova scheduler changes

Nova scheduler also needs to be modified to take these two properties into account.

Passthrough and emulate modes have different properties. So let’s consider the two separately.

openstack flavor set <flavor> \
  --property hw:maxphysaddr_mode=emulate \
  --property hw:maxphysaddr_bits=42

Note

Since ComputeCapabilitiesFilter only supports flavor extra_specs and not image properties [5], this proposal is out of scope for image properties.

Alternatives

Before the maxphysaddr option was introduced into Libvirt, it was specified as a workaround with the QEMU comanndline parameter. But this alternative is not allowed in nova.

Ubuntu package maintainers are applying a patch to QEMU [7]. It means this is not included in vanilla QEMU and is not available in other distributions.
This is only the case for hw:maxphysaddr_mode=passthrough and does not include hw:maxphysaddr_mode=emulate. Since hw:maxphysaddr_mode=passthrough requires cpu_mode=host-passthrough to be used [8], this alternative cannot be used with cpu_mode=custom or cpu_mode=host-model. So, this alternative is not sufficient for a cloud with many different CPU models.

As for scheduling, placement does not currently support numeric traits, so the maximum number of bits supported by hypervisor cannot be checked by this mechanism.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Operators should specify appropriate flavor extra_specs as needed.

Developer impact

None

Upgrade impact

As described earlier, the new traits COMPUTE_ADDRESS_SPACE_PASSTHROUGH and COMPUTE_ADDRESS_SPACE_EMULATED signal if the upgraded compute nodes support this feature.

Implementation

Assignee(s)

Primary assignee:: nmiki
Other contributors:: None

Feature Liaison

Feature liaison:: Liaison Needed

Work Items

Add new guest configs
Add new fileds in nova/api/validation/extra_specs/hw.py
Add new fields in LibvirtConfigCPU in nova/virt/livbirt/config.py
Add new traits to check Libvirt and QEMU versions
Add new field maxphysaddr to cpu_info in nova/virt/libvirt/driver.py
Add docs and release notes for new flavor extra_specs

Dependencies

Libivrt v8.7.0+. QEMU v2.7.0+.

Testing

Add the following unit tests:

check that proposed flavor extra_specs are properly validated
check that intended XML elements are output
check that traits are properly added and used
check that new field in ComputeCapabilitiesFilter is property added and used

Documentation Impact

For operators, the documentation describes what proposed flavor extra_specs mean and how they should be set.

References

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced

Review usage of oslo-privsep library on Nova

Wed, 23 Nov 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/privsep-usage-review

Nova’s usage of the privsep library is too broad. A single global permission profile with all needed capabilities is defined for all functions that interact with privsep to use. While this works, it is not the best usage of the library as functions are getting a set of rights they do not need and thus should not receive. This spec seeks to fix this situation by defining a more specialized usage of the library.

Problem description

Nova compute services use the oslo-privsep library to obtain elevated privileges on the host system with the intention of invoking python functions or linux commands that affect areas of the host that require of such privileges.

Today, Nova’s usage of privsep follows best practices that were recommended by the library when it was first created:

Create a dedicated module for privileged functions.
Create a single context and restrict its usage to that module.
Limit scope of privileged functions and reuse their actions as unprivileged code.

Based on usage of the library over the years, it has become clear that this approach is neither secure nor desirable to be continued. In the current design, a single profile is shared by all functions that make use of the library. This one aggregates all capabilities required by all privileged functions on the code. This means that for a single function that operates over the filesystem, all the other ones that do not also get such capability. This fact may lead to unexpected behaviors that can be avoided if more precise profiles are used for each case.

Use Cases

As a developer, I want to have a fined tuned method for acquiring capabilities. As an admin, I want Nova to use as little elevated privileges as possible.

Proposed change

Given that all current functions that use the privsep library are found under nova.privsep. First step is to study and map each with the capabilities they require. Next, a set of profiles can be defined for common use cases, such as network or system rights, and cover with them as much as possible. The rest will have to be divided into smaller functions that do fit into one of those profiles. If that is not possible, then the current all-capable profile will need to be kept for them until a better solution is found.

Profiles will now be defined under the __init__.py file found at: https://github.com/openstack/nova/blob/master/nova/__init__.py, while functions using these will be distributed through other packages. Here is an example on how the file may end up looking like:

legacy_pctxt = priv_context.PrivContext(
    'nova',
    cfg_section='nova_sys_admin',
    pypath=__name__ + '.legacy_pctxt',
    capabilities=[capabilities.CAP_CHOWN,
                  capabilities.CAP_DAC_OVERRIDE,
                  capabilities.CAP_DAC_READ_SEARCH,
                  capabilities.CAP_FOWNER,
                  capabilities.CAP_NET_ADMIN,
                  capabilities.CAP_SYS_ADMIN],
)

sys_admin_pctxt = priv_context.PrivContext(
    'nova',
    cfg_section='privsep_sys_admin',
    pypath=__name__ + '.sys_admin_pctxt',
    capabilities=[capabilities.CAP_SYS_ADMIN],
)

net_admin_pctxt = priv_context.PrivContext(
    'nova',
    cfg_section='privsep_net_admin',
    pypath=__name__ + '.net_admin_pctxt',
    capabilities=[capabilities.CAP_NET_ADMIN],
)

file_admin_pctxt = priv_context.PrivContext(
    'nova',
    cfg_section='privsep_file_admin',
    pypath=__name__ + '.file_admin_pctxt',
    capabilities=[capabilities.CAP_CHOWN,
                  capabilities.CAP_DAC_OVERRIDE,
                  capabilities.CAP_DAC_READ_SEARCH,
                  capabilities.CAP_FOWNER],
)

Each newly defined profile will spawn a daemon that consumes resources on the host. For such reason, no more than 4 profiles may be defined at a single time to avoid over encumbering it.

For the sake of improving usability, shared code found across the package’s functions should be extracted into other, unprivileged functions with broader contracts. These will take care of performing more generic actions, like ‘chown’ or ‘mkdir’, that may not require more than the user’s rights to be done. When elevated permissions are required though, specialized single use functions with a narrow contract will be defined using one of the new privsep contexts. These functions will be created following these conditions:

Will contain the privileged_ prefix on their name.
Will be defined at the same package that uses them.
Will only be imported by a single module, excepting unit tests.

Here is an example of how this implementation would be like:

# in nova/common/filesytem.py

def write_file(
  path: str,
  data: str = None,
  mode: str = 'w'
) -> ty.Optional[str]:
    try:
        with open(path, mode=mode) as fd:
            fd.write(data)
    except (OSError, ValueError) as e:
        LOG.debug(e)
        raise

def chown_file(
  path: str,
  usr: str = None,
  grp: str = None
) -> ty.Optional[str]:
    try:
        shutil.chown(path, user=usr, group=grp)
    except (OSError, ValueError) as e:
        LOG.debug(e)
        raise

# in nova/virt/libvirt/driver.py
import nova

from nova.common import filesystem as fs
...

@nova.file_admin_pctxt
def privileged_write_tpm_data(
  instance: uuid,
  tpm_data: str
) -> ty.Optional[str]:
    if not oslo_utils.uuidutils.is_uuid_like(instance):
        raise ValueError(f"instance: {instance} is not a valid uuid")
    path = os.path.join(CONF.instace_state_dir, instance)
    try:
        fs.write_file(path, data=tpm_data, mode='wb')
        fs.chown_file(path, "nova", "qemu")
    except (OSError, ValueError) as e:
        LOG.debug(e)

Alternatives

None that I can think of. Please, provide any feedback on the scope of this spec and its approach.

Data model impact

None

REST API impact

None

Security impact

Requires the use of elevated privileges.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

In case the tenant’s openstack distribution does not use defaults for elevated privileges configuration, then the privsep daemons spawned after this spec must be configured following the options at: https://docs.openstack.org/nova/latest/configuration/config.html#privsep.

Developer impact

Developers will need to analyze which capabilities are required for any new functions under nova.privsep and apply the correct profile accordingly.

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: jsanemet

Feature Liaison

Feature liaison:: sylvainb

Work Items

Study functions that already use oslo-privsep to determine which capabilities each need.
Define profiles for functions that share a common context, i.e.: run a system command, modify network settings…

Dependencies

None

Testing

Tempest tests must continue to pass without the need for any modifications, verifying that everything still works the same running under reduced permission sets.

Documentation Impact

None

References

First discussed at: https://etherpad.opendev.org/p/nova-privsep-review

History

Revisions
Release Name	Description
2023.1 Antelope	Introduced

Policy Service Role Default

Sun, 13 Nov 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/policy-service-role-default

Problem description

Use Cases

As an operator I want to keep service role user to access service-to-service APIs with least privilege.

Proposed change

We need to make sure all the policy rules for internal service-to-service APIs are default to service role only. Example:

policy.DocumentedRuleDefault(
    name='os_compute_api:os-server-external-events:create',
    check_str='role:service',
    scope_types=['project']
)

As Nova have dropped the system scope implementation, service-to-service communication with service role will be done with project scope token (which is currently done in devstack setup).

Below APIs policy will be default to service role:

os_compute_api:os-assisted-volume-snapshots:create
os_compute_api:os-assisted-volume-snapshots:delete
os_compute_api:os-volumes-attachments:swap
os_compute_api:os-server-external-events:create

Alternatives

Keep the service-to-service APIs default same as it is and expect operators to take care of the service role users access permissions by overriding it in the policy.yaml.

Data model impact

None

REST API impact

Below APIs policy will be default to service role:

os_compute_api:os-assisted-volume-snapshots:create
os_compute_api:os-assisted-volume-snapshots:delete
os_compute_api:os-volumes-attachments:swap
os_compute_api:os-server-external-events:create

Security impact

Easier to understand service-to-service APIs policy and restricting them to least privilege.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Developer impact

New APIs must add policies that follow the new pattern.

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: gmann

Feature Liaison

Feature liaison:: dansmith

Work Items

Modify the service-to-service APIs defaults
Modify policy rule unit tests

Dependencies

None

Testing

Modify or add the policy unit tests.

Add a job enabling the new defaults and run the tempest tests to make sure existing service-service APIs communication work fine. If needed modify the token used by services as per the new defaults.

Documentation Impact

API Reference should be updated to add all the service-service APIs under separate section and mention about service role as their default.

References

History

Revisions
Release Name	Description
2023.1	Introduced

Allow Manila shares to be directly attached to an instance when using libvirt

Thu, 10 Nov 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/libvirt-virtiofs-attach-manila-shares

Problem description

Use Cases

As an operator I want the Manila datapath to be separate to any tenant accessible networks.
As a user I want to attach Manila shares directly to my instance and have a simple interface with which to mount them within the instance.
As a user I want to detach a directly attached Manila share from my instance.
As a user I want to track the Manila shares attached to my instance.

Proposed change

A new server shares API will be introduced under a new microversion. This will list current shares, show their details and allow a share to be attached or detached.

Note

The libvirt driver will be extended to support the above with initial support for cold attach and detach. Future work will aim to add live attach and detach once support lands in libvirt itself.

COMPUTE_STORAGE_VIRTIO_FS trait

and either the

COMPUTE_MEM_BACKING_FILE trait

that the instance is configured with hw:mem_page_size extra spec.

From an operator’s point of view, it means COMPUTE_STORAGE_VIRTIO_FS support requires that operators must upgrade all their compute nodes to the version supporting shares using virtiofs.

Users will be able to mount the attached shares using a mount tag, this is either the share UUID from Manila or a string provided by the users with their request to attach the share.

user@instance $ mount -t virtiofs $tag /mnt/mount/path

Share mapping status:

                     +----------------------------------------------------+   Reboot VM
    Start VM         |                                                    | --------------+
    Share mounted    |                       active                       |               |
+------------------> |                                                    | <-------------+
|                    +----------------------------------------------------+
|                      |                   |             |
|                      | Stop VM           |             |
|                      | Fail to umount    |             |
|                      v                   |             |
|                    +------------------+  |             |
|                    |      error       | <+-------------+-------------------+
|                    +------------------+  |             |                   |
|                      |                   |             |                   |
|                      | Detach share or   |             |                   |
|                      | delete VM         | Delete VM   |                   |
|                      v                   |             |                   |
|                    +------------------+  |             |                   |
|    +-------------> |        φ         | <+             |                   | Start VM
|    |               +------------------+                |                   | Fail to mount
|    |                 |                                 |                   |
|    | Detach share    |                                 | Stop VM           |
|    | or delete VM    | Attach share                    | Share unmounted   |
|    |                 v                                 v                   |
|    |               +----------------------------------------------------+  |
|    +-------------- |                      inactive                      | -+
|                    +----------------------------------------------------+
|                      |
+----------------------+

φ: means no entry in the database. No association between a share and a server.
Attach share: means POST /servers/{server_id}/shares
Detach share: means DELETE /servers/{server_id}/shares

This chart describe the share mapping status (nova), this is independent from the status of the Manila share.

Umount operation will be really done when the share is mounted and not used anymore by another server.

With the above mount and umount operation, the state is stored in memory and do not require a lookup in the database.

Manila share removal issue:

Instance metadata:

Alternatives

REST API impact

A new server level shares API will be introduced under a new microversion with the following methods:

GET /servers/{server_id}/shares

List all shares attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "shares": [
        {
            "shareId": "48c16a1a-183f-4052-9dac-0e4fc1e498ad",
            "status": "active",
            "tag": "foo"
        },
        {
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar"
        }
    ]
}

GET /servers/{server_id}/shares/{shareId}

Show details of a specific share attached to an instance.

Return Code(s): 200,400,401,403,404

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar"
    }
}

PROJECT_ADMIN will be able to see details of the attachment id and export location stored within Nova:

{
    "share": {
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

POST /servers/{server_id}/shares

Attach a share to an instance.

Prerequisite(s):

Instance much be in the SHUTOFF state.
Instance should have the required capabilities to enable virtiofs (see above).

Return Code(s): 202,400,401,403,404,409

Request body:

Note

tag will be an optional request parameter in the request body, when not provided it will be the shareId(UUID) as always provided in the request.

tag if povided by the user must be an ASCII string with a maximum lenght of 64 bytes.

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

Response body:

{
    "share": {
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "e8debdc0-447a-4376-a10a-4cd9122d7986"
    }
}

DELETE /servers/{server_id}/shares/{shareId}

Detach a share from an instance.

Prerequisite(s): Instance much be in the SHUTOFF or ERROR state.

Return Code(s): 202,400,401,403,404,409

Data model impact

A new share_mapping database table will be introduced.

id - Primary key autoincrement
uuid - Unique UUID to identify the particular share attachment
instance_uuid - The UUID of the instance the share will be attached to
share_id - The UUID of the share in Manila
status - The status of the share attachment within Nova (active, inactive, error)
tag - The device tag to be used by users to mount the share within the instance.
export_location - The export location used to attach the share to the underlying host
share_proto - The Shared File Systems protocol (NFS, CEPHFS)

A new base ShareMapping versioned object will be introduced to encapsulate the above database entries and to be used as the parent class of specific virt driver implementations.

Fields containing text will use String and not Text type in the database schema to limit the column width and be stored inline in the database.

This base ShareMapping object will provide stub attach and detach methods that will need to be implemented by any child objects.

New ShareMappingLibvirt, ShareMappingLibvirtNFS and ShareMappingLibvirtCephFS objects will be introduced as part of the libvirt implementation.

Security impact

Notifications impact

New notifications will be added:

One to add new notifications for share attach and share detach.
One to extend the instance update notification with the share mapping information.

Share mapping in the instance payload will be optional and controlled via the include_share_mapping notification configuration parameter. It will be disabled by default.

Proposed payload for attached and detached notification will be the same as the one returned by the show command with admin rights.

{
    "share": {
        "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
        "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
        "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
        "status": "active",
        "tag": "bar",
        "export_location": "server.com/nfs_mount,foo=bar"
    }
}

Proposed instance payload for instance updade, will be the list of share attached to this instance.

{
    "shares":
    [
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-9dc2a67bd8bf",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7986",
            "status": "active",
            "tag": "bar",
            "export_location": "server.com/nfs_mount,foo=bar"
        },
        {
            "instance_uuid": "7754440a-1cb7-4d5b-b357-9b37151a4f2d",
            "attachmentId": "715335c1-7a00-4dfe-82df-ffffffffffff",
            "shareId": "e8debdc0-447a-4376-a10a-4cd9122d7987",
            "status": "active",
            "tag": "baz",
            "export_location": "server2.com/nfs_mount,foo=bar"
        }
    ]
}

Other end user impact

Users will need to mount the shares within their guestOS using the returned tag.

Users could use the instance metadata to discover and auto mount the share.

Performance Impact

Other deployer impact

None

Developer impact

None

Upgrade impact

Implementation

Assignee(s)

Primary assignee:: uggla (rene.ribaud)
Other contributors:: lyarwood (initial contributor)

Feature Liaison

Feature liaison:: uggla

Work Items

Add new capability traits within os-traits
Add support within the libvirt driver for cold attach and detach
Add new shares API and microversion

Dependencies

None

Testing

Functional libvirt driver and API tests
Integration Tempest tests

Documentation Impact

Extensive admin and user documentation will be provided.

References

History

Revisions
Release Name	Description
Yoga	Introduced
Zed	Reproposed
Antelope	Reproposed

libvirt driver support for flavor and image defined ephemeral encryption

Wed, 09 Nov 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-encryption-libvirt

This spec outlines the specific libvirt virt driver implementation to support the Flavor and Image defined ephemeral storage encryption [1] spec.

Problem description

The libvirt virt driver currently provides very limited support for ephemeral disk encryption through the LVM imagebackend and the use of the PLAIN encryption format provided by dm-crypt.

Use Cases

As a user of a cloud with libvirt based computes I want to request that all of my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user of a cloud with libvirt based computes I want to be able to pick how my ephemeral storage be encrypted at rest through the selection of a specific flavor or image.
As a user I want each encrypted ephemeral disk attached to my instance to have a separate unique secret associated with it.
As an operator I want to allow users to request that the ephemeral storage of their instances is encrypted using the flexible LUKSv1 encryption format.

Proposed change

Deprecate the legacy implementation within the libvirt driver

The legacy implementation using dm-crypt within the libvirt virt driver needs to be deprecated ahead of removal in a future release, this includes the following options:

[ephemeral_storage_encryption]/enabled
[ephemeral_storage_encryption]/cipher
[ephemeral_storage_encryption]/key_size

Limited support for dm-crypt will be introduced using the new framework before this original implementation is removed.

Populate disk_info with encryption properties

This dict currently contains the following:

disk_bus: The default bus used by disks
cdrom_bus: The default bus used by cd-rom drives
mapping: A nested dict keyed by disk name including information about each disk.

Each item within the mapping dict containing following keys:

bus: The bus for this disk
dev: The device name for this disk as known to libvirt
type: A type from the BlockDeviceType enum (‘disk’, ‘cdrom’,’floppy’, ‘fs’, or ‘lun’)

It can also contain the following optional keys:

format: Used to format swap/ephemeral disks before passing to instance (e.g. ‘swap’, ‘ext4’)
boot_index: The 1-based boot index of the disk.

In addition to the above this spec will also optionally add the following keys for encrypted disks:

encryption_format: The encryption format used by the disk
encryption_options: A dict of encryption options
encryption_secret_uuid: The UUID of the encryption secret associated with the disk

Handle ephemeral disk encryption within imagebackend

With the above in place we can now add encryption support within each image backend. As highlighted at the start of this spec this initial support will only be for the LUKSv1 encryption format.

Generic key management code will be introduced into the base nova.virt.libvirt.imagebackend.Image class and used to create and store the encryption secret within the configured key manager. The initial LUKSv1 support will store a passphrase for each disk within the key manager. This is unlike the current ephemeral storage encryption or encrypted volume implementations that currently store a symmetric key in the key manager. This remains a long running piece of technical debt in the encrypted volume implementation as LUKSv1 does not directly encrypt data with the provided key.

Each backend will then be modified to encrypt disks during nova.virt.libvirt.imagebackend.Image.create_image using the provided format, options and secret.

Enable the `COMPUTE_EPHEMERAL_ENCRYPTION_LUKS` trait

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

As discussed above the ephemeral encryption keys will be added to the disk_info for individual disks within the libvirt driver.

REST API impact

N/A

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

Other deployer impact

N/A

Developer impact

Upgrade impact

The legacy implementation is deprecated but will continue to work for the time being. As the new implementation is separate there is no further upgrade impact.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Populate the individual disk dicts within disk_info with any ephemeral encryption properties.
Provide these properties to the imagebackends when creating each disk.
Introduce support for LUKSv1 based encryption within the imagebackends.
Enable the COMPUTE_EPHEMERAL_ENCRYPTION_LUKS trait when the selected imagebackend supports LUKSv1.

Dependencies

Flavor and Image defined ephemeral storage encryption [1]

Testing

Documentation Impact

New user documentation around the specific LUKSv1 support for ephemeral encryption within the libvirt driver.
Reference documentation around the changes to the virt block device layer.
Document that for the raw imagebackend, both [libvirt]images_type = raw and [DEFAULT]use_cow_images = False must be configured in order for resize to work. This is also true without encryption but it may still be helpful to users.
Document that a user must have policy permission to create secrets in Barbican in order for encryption to work for that user. Secrets are created in Barbican using the user’s auth token. Admins have permission to create secrets in Barbican by default.

References

Revisions
Release Name	Description
Wallaby	Introduced
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed

Flavour and Image defined ephemeral storage encryption

Wed, 09 Nov 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/ephemeral-storage-encryption

Note

This spec will only cover the high level changes to the API and compute layers, implementation within specific virt drivers is left for separate specs.

Problem description

Use Cases

As a user I want to request that all of my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As a user I want to be able to pick how my ephemeral storage is encrypted at rest through the selection of a specific flavor or image.
As an admin/operator I want to either enforce ephemeral encryption per flavor or per image.
As an admin/operator I want to provide sane choices to my end users regarding how their ephemeral storage is encrypted at rest.
As a virt driver maintainer/developer I want to indicate that my driver supports ephemeral storage encryption using a specific encryption format.
As a virt driver maintainer/developer I want to provide sane default encryption format and options for users looking to encrypt their ephemeral storage at rest. I want these associated with the encrypted storage until it is deleted.

Proposed change

To enable this new flavor extra specs, image properties and host configurables will be introduced. These will control when and how ephemeral storage encryption at rest is enabled for an instance.

Note

Separate image properties have been documented in the Glance image encryption and Cinder image encryption specs to cover how images can be encrypted at rest within Glance.

Allow ephemeral encryption to be configured by flavor, image or config

To enable ephemeral encryption per instance the following boolean based flavor extra spec and image property will be introduced:

hw:ephemeral_encryption
hw_ephemeral_encryption

The encryption format used will be controlled by the following flavor extra specs and image properties:

hw:ephemeral_encryption_format
hw_ephemeral_encryption_format

[ephemeral_storage_encryption]/default_format

The format will be provided as a string that maps to a BlockDeviceEncryptionFormatTypeField oslo.versionedobjects field value:

plain for the plain dm-crypt format
luks for the LUKSv1 format

BlockDeviceMapping changes

The BlockDeviceMapping object will be extended to include the following fields encapsulating some of the above information per ephemeral disk within the instance:

encrypted: A simple boolean to indicate if the block device is encrypted. This will initially only be populated when ephemeral encryption is used but could easily be used for encrypted volumes as well in the future.
encryption_secret_uuid: As the name suggests this will contain the UUID of the associated encryption secret for the disk. The type of secret used here will be specific to the encryption format and virt driver used, it should not be assumed that this will always been an symmetric key as is currently the case with all encrypted volumes provided by Cinder. For example, for luks based ephemeral storage this secret will be a passphrase.
encryption_format: A new BlockDeviceEncryptionFormatType enum and associated BlockDeviceEncryptionFormatTypeField field listing the encryption format. The available options being kept in line with the constants currently provided by os-brick and potentially merged in the future if both can share these types and fields somehow.
encryption_options: A simple unversioned dict of strings containing encryption options specific to the virt driver implementation, underlying hypervisor and format being used.

Note

Encryption options could be exposed to end users in the future when a proper design which addresses security and handles all upgrade scenarios is developed.

Populate ephemeral encryption BlockDeviceMapping attributes during build

Use `COMPUTE_EPHEMERAL_ENCRYPTION` compatibility traits

COMPUTE_EPHEMERAL_ENCRYPTION_LUKS
COMPUTE_EPHEMERAL_ENCRYPTION_LUKSV2
COMPUTE_EPHEMERAL_ENCRYPTION_PLAIN

Introduce an ephemeral encryption request pre-filter

Expose ephemeral encryption attributes via block_device_info

root_device_name: The root device path used by the instance.
ephemerals: A list of DriverEphemeralBlockDevice dict objects detailing the ephemeral disks attached to the instance. Note this does not include the initial image based disk used by the instance that is classified as an ephemeral disk in terms of the ephemeral encryption feature.
block_device_mapping: A list of DriverVol*BlockDevice dict objects detailing the volume based disks attached to the instance.
swap: An optional DriverSwapBlockDevice dict object detailing the swap device.

For example:

{
    "root_device_name": "/dev/vda",
    "ephemerals": [
        {
            "guest_format": null,
            "device_name": "/dev/vdb",
            "device_type": "disk",
            "size": 1,
            "disk_bus": "virtio"
        }
    ],
    "block_device_mapping": [],
    "swap": {
        "swap_size": 1,
        "device_name": "/dev/vdc",
        "disk_bus": "virtio"
    }
}

Report that a disk is encrypted at rest through the metadata API

Extend the metadata API so that users can confirm that their ephemeral storage is encrypted at rest through the metadata API, accessible from within their instance.

{
    "devices": [
        {
            "type": "nic",
            "bus": "pci",
            "address": "0000:00:02.0",
            "mac": "00:11:22:33:44:55",
            "tags": ["trusted"]
        },
        {
            "type": "disk",
            "bus": "virtio",
            "address": "0:0",
            "serial": "12352423",
            "path": "/dev/vda",
            "encrypted": "True"
        },
        {
            "type": "disk",
            "bus": "ide",
            "address": "0:0",
            "serial": "disk-vol-2352423",
            "path": "/dev/sda",
            "tags": ["baz"]
        }
    ]
}

This should also be extended to cover disks provided by encrypted volumes but this is obviously out of scope for this implementation.

Block resize between flavors with different hw:ephemeral_encryption settings

Provide a migration path from the legacy implementation

New nova-manage and nova-status commands will be introduced to migrate any instances using the legacy libvirt virt driver implementation ahead of the removal of this in a future release.

The nova-manage command will ensure that any existing instances with ephemeral_key_uuid set will have their associated BlockDeviceMapping records updated to reference said secret key, the plain encryption format and configured options on the host before clearing ephemeral_key_uuid.

The nova-status command will simply report on the existence of any instances with ephemeral_key_uuid set that do not have the corresponding BlockDeviceMapping attributes enabled etc.

Deprecate the now legacy implementation

The legacy implementation within the libvirt virt driver will be deprecated for removal in a future release once the ability to migrate is in place.

Alternatives

Continue to use the transparent host configurables and expand support to other encryption formats such as LUKS.

Data model impact

See above for the various flavor extra spec, image property, BlockDeviceMapping and DriverBlockDevice object changes.

REST API impact

Flavor extra specs and image property validation will be introduced for the any ephemeral encryption provided options.
Attempts to resize between flavors that differ in their ephemeral encryption options will be rejected.
Attempts to rebuild between images that differ in their ephemeral encryption options will be allowed.
The metadata API will be changed to allow users to determine if their ephemeral storage is encrypted as discussed above.

Security impact

This should hopefully be positive given the unique secret per disk and user visible choice regarding how their ephemeral storage is encrypted at rest.

Notifications impact

N/A

Other end user impact

Users will now need to opt-in to ephemeral storage encryption being used by their instances through their choice of image or flavors.

Performance Impact

The additional pre-filter will add a small amount of overhead when scheduling instances but this should fail fast if ephemeral encryption is not requested through the image or flavor.

The performance impact of increased use of ephemeral storage encryption by instances is left to be discussed in the virt driver specific specs as this will vary between hypervisors.

Other deployer impact

N/A

Developer impact

Virt driver developers will be able to indicate support for specific ephemeral storage encryption formats using the newly introduced compute compatibility traits.

Upgrade impact

The compute traits should ensure that requests to schedule instances using ephemeral storage encryption with mixed computes (N-1 and N) will work during a rolling upgrade.

Implementation

Assignee(s)

Primary assignee:: melwitt
Other contributors:: lyarwood

Feature Liaison

Feature liaison:: melwitt

Work Items

Introduce hw_ephemeral_encryption* image properties and hw:ephemeral_encryption flavor extra specs.
Introduce a new encrypted. encryption_secret_uuid, encryption_format and encryption_options attributes to the BlockDeviceMapping Object.
Wire up the new BlockDeviceMapping object attributes through the Driver*BlockDevice layer and block_device_info dict.
Report ephemeral storage encryption through the metadata API.
Introduce new nova-manage and nova-status commands to allow existing users to migrate to this new implementation. This should however be blocked outside of testing until a virt driver implementation is landed.
Validate all of the above in functional tests ahead of any virt driver implementation landing.

Dependencies

None

Testing

At present without a virt driver implementation this will be tested entirely within our unit and functional test suites.

Once a virt driver implementation is available additional integration tests in Tempest and whitebox tests can be written.

Testing of the migration path from the legacy implementation will require an additional grenade job but this will require the libvirt virt driver implementation to be completed first.

Documentation Impact

The new host configurables, flavor extra specs and image properties should be documented.
New user documentation should be written covering the overall use of the feature from a Nova point of view.
Reference documentation around BlockDeviceMapping objects etc should be updated to make note of the new encryption attributes.

References

History

Optional section intended to be used each time the spec is updated to describe new design, API or any database schema updated. Useful to let reader understand what’s happened along the time.

Revisions
Release Name	Description
Wallaby	Introduced
Xena	Reproposed
Yoga	Reproposed
Zed	Reproposed
2023.1 Antelope	Reproposed

Per Process Healthcheck endpoints

Wed, 09 Nov 2022 00:00:00

https://blueprints.launchpad.net/nova/+spec/per-process-healthchecks

Problem description

To monitor the health of a Nova service today requires experience to develop and implement a series of external heuristics to infer the state of the service binaries.

The existing Oslo middleware does not address this problem statement because:

It can only be used by the API and metadata binaries
The middleware does not tell you the service is alive if its hosted by a WSGI server like Apache since the middleware is executed independently from the WSGI application. i.e. the middleware can pass while the nova-api can’t connect to the DB and is otherwise broken.
The Oslo middleware in detailed mode leaks info about the host Python kernel, Python version and hostname which can be used to determine in the host is vulnerable to CVEs which means it should never be exposed to the Internet. e.g.

platform: 'Linux-5.15.2-xanmod1-tt-x86_64-with-glibc2.2.5',
python_version: '3.8.12 (default, Aug 30 2021, 16:42:10) \n[GCC 10.3.0]'

Use Cases

As an operator, I want a simple REST endpoint I can consume to know if a Nova process is healthy.

As an operator I want this health check to not impact the performance of the service so it can be queried frequently at short intervals.

As an operator I would like to be able to use health-check of the Nova API and metadata services to manage the membership of endpoints in my load-balancer or reverse proxy automatically.

Proposed change

Definitions

TTL: The time interval for which a health check item is valid.

pass: all health indicators are passing and their TTLs have not expired.

warn: any health indicator has an expired TTL or where there is a partial transient failure.

fail: any health indicator is reporting an error or all TTLs are expired.

Warn vs fail

Services in the warn state are still considered healthy in most cases but they may be about to fail soon or be partially degraded.

Code changes

A new top-level Nova health check module will be created to encapsulate the common code and data structure required to implement this feature.

A new health check manager class will be introduced which will maintain the health-check state and all functions related to retrieving, updating and summarizing that state.

The health check manager will be responsible for creating the health check endpoint when it is enabled in the nova.conf and exposing the health check over HTTP.

e.g.

@healthcheck('database', [SQLAlchemyError])
def my_db_func(self):
    pass

@healthcheck('database', [SQLAlchemyError])
def my_other_db_func(self):
    pass

By default all exceptions will be caught and re-raised by the decorator.

If implemented, the etag will be incremented whenever the service state changes and will reset to 0 when the service is restarted.

Example output

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 200 OK
Content-Type: application/health+json
Cache-Control: max-age=3600
Connection: close

{
    "status": "pass",
    "version": "1.0",
    "serviceId": "e3c22423-cd7a-47dc-b6e9-e18d1a8b3bdf",
    "description": "nova-api",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks": {
        "message_bus": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "api_db": {"status": "pass", "time": "2021-12-17T16:02:55+00:00"}
    }
}

GET /health HTTP/1.1
Host: example.org
Accept: application/health+json

HTTP/1.1 503 Sevice Unavailable
Content-Type: application/health+json
Cache-Control: no-cache
Connection: close

{
    "status": "fail",
    "version": "1.0",
    "serviceId": "0a47dceb-11b1-4d94-8b9c-927d998be320",
    "description": "nova-compute",
    "notes": {"host": "controller-1.cloud", "hostname": "controller-1.cloud"}
    "checks":{
        "message_bus":{"status": "pass", "time": "2021-12-17T16:02:55+00:00"},
        "hypervisor":{
             "status": "fail", "time": "2021-12-17T16:05:55+00:00",
             "output": "Libvirt Error: ..."
        }
    }
}

Alternatives

Data model impact

The Nova context object will be extended to store a reference to the health check manager.

REST API impact

None

While this change will expose a new REST API endpoint it will not be part of the existing Nova API.

Security impact

Notifications impact

None

Other end user impact

None

At present, it is not planned to extend the Nova client or the unified client to query the new endpoint. cURL, socat, or any other UNIX socket or TCP HTTP client can be used to invoke the endpoint.

Performance Impact

None

Other deployer impact

A new config section healthcheck will be added in the nova.conf

A uri config option will be introduced to enable the health check functionality. The config option will be a string opt that supports a comma-separated list of URIs with the following format

uri=<scheme>://[host:port|path],<scheme>://[host:port|path]

e.g.

[healthcheck]
uri=tcp://localhost:424242

[healthcheck]
uri=unix:///run/nova/nova-compute.sock

[healthcheck]
uri=tcp://localhost:424242,unix:///run/nova/nova-compute.sock

Developer impact

Upgrade impact

None

Implementation

Assignee(s)

Primary assignee:: sean-k-mooney
Other contributors:: melwitt

Feature Liaison

Feature liaison:: sean-k-mooney

Work Items

Add new module
Introduce decorator
Extend context object to store a reference to health check manager
Add config options
Expose TCP endpoint
Expose UNIX socket endpoint support
Add docs

Dependencies

None

Testing

This can be tested entirely with unit and functional tests, however, Devstack will be extended to expose the endpoint and use it to determine whether the Nova services have started.

Documentation Impact

The config options will be documented in the config reference and a release note will be added for the feature.

References

Yoga PTG topic:
https://etherpad.opendev.org/p/r.e70aa851abf8644c29c8abe4bce32b81#L415

History

Revisions
Release Name	Description
Yoga	Introduced
2023.1 Antelope	Reproposed