Ironic Shards

https://blueprints.launchpad.net/nova/+spec/ironic-shards

Problem description

Nova’s Ironic driver involves a single nova-compute service managing many compute nodes, where each compute node record maps to an Ironic node. Some deployments support 1000s of ironic nodes, but a single nova-compute service is unable to manage 1000s of nodes and 1000s of instances.

Currently we support setting a partition key, where nova-compute only cares about a subset of ironic nodes, those associated with a specific conductor group. However, some conductor groups can be very large, servered by many ironic-conductor services.

To help with this, Nova has attempted to dynamically spread ironic nodes between a set of nova-compute peers. While this work some of the time, there are some major limitations:

  • when one nova-compute is down, only unassigned ironic nodes can move to another nova-compute service

  • i.e. when one nova-compute is down, all ironic nodes with nova instances associated with the down nova-compute service are unable to be managed, i.e. reboot will fail

  • moreover, when the old nova-compute comes back up, which might take some time, there are lots of bugs as the hash ring slowly rebalances. In part because every nova-compute fetches all nodes, in a large enough cloud, this can take over 24 hours.

This spec about tweaking the way we shard Ironic compute nodes. We need to stop violating deep assumptions in the compute manager code by moving to a more static ironic node partitions.

Use Cases

Any users of the ironic driver that have more than one nova-compute service per conductor group should move to an active-passive failover mode.

The new static sharding will be of paritcular interest for clouds with ironic conductor groups that are greater than around 1000 baremetal nodes.

Proposed change

We add a new configuration option:

  • [ironic] shard_key

By default, there will be no shard_key set, and we will continue to expose all ironic nodes from a single nova-compute process. Mostly, this is to keep things simple for smaller deployments, i.e. when you have less than 500 ironic nodes.

When the operator sets a shard_key, the compute-node process will use the shard_key when querying a list of nodes in Ironic. We must never try to list all Ironic nodes when the Ironic shard key is defined in the config.

When we look up a specific ironic node via a node uuid or instance uuid, we should not restrict that to either the shard key or conductor group.

Similar to checking the instance uuid is still present on the Ironic node before performing an action, or ensuring there is no instance uuid before provisioning, we should also check the node is in the correct shard (and conductor group) before doing anything with that Ironic node.

Config changes and Deprecations

We will keep the option to target a specific conductor group, but this option will be renamed from partition_key to conductor_group. This is addative to the shard_key above, the target ironic nodes are those in both the correct shard_key and the correct conductor_group, when both are configured.

We will deprecate the use of the peer_list. We should log a warning when the hash ring is being used, i.e. when it has more than one member added to the hash ring.

In addtion, we need the logic that tries to move Compute Nodes to never work unless the peer_list is larger than one. More details in the data model impact section.

When deleting a ComputeNode object, we need to have the driver confirm that is safe. In the case of Ironic we will check to see if the configured Ironic has a node with that uuid, searching across all conductor groups and all shard keys. When the ComputeNode object is not deleted, we should not delete the entry in placement.

nova-manage move ironic node

We will create a new nova-manage command:

nova-manage ironic-compute-node-move <ironic-node-uuid> \
    --service <destination-service>

This command will do the following:

  • Find the ComputeNode object for this ironic-node-uuid

  • Error if the ComputeNode type does not match the ironic driver.

  • Find the related Service object for the above ComputeNode (i.e. the host)

  • Error if the service object is not reported as down, and has not also been put into maintanance. We do not require forced down, because we might only be moving a subset of nodes associated with this nova-compute service.

  • Check the Service object for the destination service host exists

  • Find all non-deleted instances for this (host,node)

  • Error if there is more than 1 non-deleted instance found. It is OK if we find zero or 1 instances.

  • In one DB transaction: move the ComputeNode object to the destination service host and move the Instance (if there is one) to the destination service host

The above tool is expected to be used as part of this wider process of migrating from the old peer_list to the new shard key. There are two key scearios (although the tool may help operator recover from other issues as well):

  • moving from a peer_list to a single nova-compute

  • moving from peer_list to shard_key, while keeping multiple nova-compute proccesses (for a single conductor group)

Migrate from peer_list to single nova-compute

Small deployments (i.e. less than 500 ironic nodes) are recommended to move from a peer_list of, for example, three nova-compute services, to a single nova-compute service. On failure of the nova-compute service, operators can either manually start the processes on a new host, or use an automatic active-passive HA scheme.

The process would look something like this:

  • ironic and nova both default to an empty_shard key by default, such that all ironic nodes are in the same default shard

  • start a new nova-compute service running the ironic driver, ideally with a syntheic value for [DEFAULT]host e.g. ironic This will log warnings about the need to use the nova-compute migration tool before being able to manage any nodes

  • stop all existing nova-compute services

  • mark them as forced-down via the API

  • Now loop around all ironic nodes and call this, assuming your nova-compute service has its host value of just ironic: nova_manage ironic-compute-node-move <uuid> –service ironic

The periodic tasks in the new nova-compute service will gradually pick up the new ComputeNodes, and will start being able to recieve commands such a reboot for all the moved instances.

While you could start the new nova-compute service after having migrated all the ironic compute nodes, but that would lead to higher downtime during the migration.

Migrate from peer_list to shard_key

The proccess to move from the hash key based peer_list to the static shard_key from ironic is very similar to the above process:

  • Set the shard_key on all your ironic nodes, such that you can spread the nodes out between your nova-compute processes,

  • Start your new nova compute processes, one for each shard_key, possibly setting a synthetic [DEFAULT]host value that matches the my_shard_key.

  • Shutdown all the older nova-compute processs with [ironic]peer_list set

  • Mark those older services as in maintainance via the Nova API

  • For each shard_key in Ironic, work out which service host you have mapped each one to above, then run this for each ironic node uuid in the shard: nova_manage ironic-compute-node-move <uuid> –service my_shard_key

  • Delete the old services via the Nova API, now there are no instances or compute nodes on those services

While you could start the new nova-compute services after the migration, that would lead to a slightly longer downtime.

Adding new compute nodes

In general, there is no change when adding nodes into existing shards.

Similarly, you can add a new nova-compute process for a new shard and then start to fill that up with nodes.

Move an ironic node between shards

When removing nodes from ironic at the end of their life, or adding large numbers of new nodes, you may need to rebalance the shards.

To move some ironic nodes, you need to move the nodes in groups associated with a specific nova-compute process. For each nova-compute and the associated ironic nodes you want to move to a different shard you need to:

  • Shutdown the affected nova-compute process

  • Put nova-compute services into in maintanance

  • In Ironic API update the shard key on the Ironic node

  • Now move each ironic node to the correct new nova-compute process for the shard key it was moved into: nova_manage ironic-compute-node-move <uuid> –service my_shard_key

  • Now unset maintanance mode for the nova-compute, and start that service back up

Move shards between nova-compute services

To move a shard between nova-compute services, you need to replace the nova-compute process with a new one:

  • ensure the destination nova-compute is configured with the shard you want to move, and is running

  • stop the nova-compute process currently serving the shard

  • force-down the service via the API

  • for each ironic node uuid in the shard call nova-manage to move it to the new nova-compute process

Alternatives

We could require nova-compute processes to be explicitly forced down, before allowing the nova-manage to move the ironic nodes about, in a similar way to evacuate. But this creates problems when trying to re-balance shards as you remove nodes at the end of their life.

We could consider a list of shard keys, rather than a single shard key per nova-compute. But for this first version, we have chosen the simpler path, that appears to have few limitations.

We could attempt to keep fixing the hash ring recovery within the ironic driver, but its very unclear what will break next due to all the deep assumptions made about the nova-compute process. The specific assumptions include:

  • when nova-compute breaks, its usually the hypervisor hardware that has broken, which includes all the nova servers running on that.

  • all locking and management of a nova server object is done by the currently assigned nova-compute node, and this is only ever changed by explict move operations like resize, migrate, live-migration and evacuate. As such we can use simple local locks to ensure concurrent operations don’t conflict, along with DB state checking.

Data model impact

A key thing we need to ensure is that ComputeNode objects are only automatically moved between service objects when in legacy hash ring mode. Currently, this only happens for unassigned ComputeNodes.

In this new explicit shard mode, only nova-manage is able to move ComputeNode objects. In addtion, nova-manage will also move associated instances. However, similar to evacuate, this will only be allowed when the currently associated service is forced down.

Note, this applies when a nova-compute finds a ComputeNode that is should own, but the Nova database says its already owned by a difference service. In this scenario, we should log a warning to the operator to ensure they have migrated that ComputeNode from its old location before this nova-compute service is able to manage it.

In addition, we should ensure we only delete a ComputeNode object when the driver explictly says its safe to delete. In the case of the Ironic driver, we should ensure the node no longer exists in Ironic, being sure to search across all shards.

This is all very related this spec on robustfying the Compute Node and Service object relationship: https://review.opendev.org/c/openstack/nova-specs/+/853837

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

Users will experience a more reliable Ironic and Nova integration.

Performance Impact

It should help users more easily support large ironic deployments integrated with Nova.

Other deployer impact

We will rename the “partition_key” configuration to be expliclity “conductor_group”.

We will deprecate the peer list key. When we start up and see anything set, we ommit a warning about the bugs in using this legacy auto sharding, and recomend moving to the explicit sharding.

There is a new shard_key config, as descirbed above.

There is a new nova_manage CLI command to move Ironic compute nodes on forced-down nova-compute services to a new one.

Developer impact

None

Upgrade impact

For those currenly using peer_list, we need to document how they can move to the new sharding approach.

Implementation

Assignee(s)

Primary assignee:

JayF

Other contributors:

johnthetubaguy

Feature Liaison

Feature liaison: None

Work Items

  • rename conductor group partition key config

  • deprecate peer_list config, with warning log messages

  • add compute node move and delete protections, when peer_list not used

  • add new shard_key config, limit ironic node list using shard_key

  • add nova-manage tool to move ironic nodes between compute services

  • document operational processes around above nova-manage tool

Dependencies

The deprecation of the peer list can happen right away.

But the new sharding depends on the Ironic shard key getting added: https://review.opendev.org/c/openstack/ironic-specs/+/861803

Ideally we add this into Nova after robustify compute node has landed: https://review.opendev.org/c/openstack/nova/+/842478

Testing

We need some functional tests for the nova-manage command to ensure all of the safty guards work as expected.

Documentation Impact

A lot of docs needed for the Ironic driver on the operational procedures around the shard_key.

References

None

History

Revisions

Release Name

Description

2023.1 Antelope

Introduced