Use conductor groups to partition nova-compute services for Ironic

https://blueprints.launchpad.net/nova/+spec/ironic-conductor-groups

Use ironic’s conductor group feature to limit the subset of nodes which a nova-compute service will manage. This allows for partitioning nova-compute services to a particular location (building, aisle, rack, etc), and provides a way for operators to manage the failure domain of a given nova-compute service.

Problem description

As OpenStack deployments become larger, and edge compute becomes a reality, there is a desire to be able to co-locate the nova-compute service with some subset of ironic nodes.

There is also a desire to be able to reduce the failure domain of a nova-compute service, and to be able to make the failure domain more predictable in terms of which ironic nodes can no longer be scheduled to.

Use Cases

Operators managing large and/or distributed ironic environments need more control over the failure domain of a nova-compute service.

Proposed change

A configuration option partition_key will be added, to tell the nova-compute service which conductor_group (an ironic-ism) it is responsible for managing. This will be used as a filter when querying the list of nodes from ironic, so that only the subset of nodes which have a conductor_group matching the partition_key will be returned.

As nova-compute services have a hash ring which further partitions the subset of nodes which a given nova-compute service is managing, we need a mechanism to tell the service which other compute services are managing the same partition_key. To do this, we will add another configuration option, peer_list, which is a comma-separated list of hostnames of other compute services managing the same subset of nodes. If set, this will be used instead of the current code, which fetches a list of all compute services running the ironic driver from the database. To ensure that the hash ring splits nodes only between currently running compute services, we will check this list against the database and filter out any inactive services (i.e. has not checked in recently) listed in peer_list.

partition_key will default to None. If the value is None, this functionality will be disabled, and the behavior will be the same as before, where all nodes are eligible to be managed by the compute service, and all compute services are considered as peers. Any other value will enable this service, limiting the nodes to the conductor group matching partition_key, and using the peer_list configuration option to determine the list of peers.

Both options will be added to the [ironic] config group, and will be “mutable”, meaning it only requires a SIGHUP to update the running service with new config values.

Alternatives

Ideally, we wouldn’t need a peer_list configuration option, as we would be able to dynamically fetch this list from the database, and it’s prone to operator mistakes.

One option to do this is to add a field to the compute service record, to store the partition key. Compute services running the ironic driver could then use this field to determine their peer list. During the Stein PTG discussion about this feature, we agreed not to do this, as adding fields or blobjects in the service record for a single driver is a layer violation.

Another option is for the ironic driver to manage its own list of live services in something like etcd, and the peer list could be determined from here. This also feels like a layer violation, and requiring an etcd cluster only for a particular driver feels confusing at best from an operator POV.

Data model impact

None.

REST API impact

None.

Security impact

None.

Notifications impact

None.

Other end user impact

None.

Performance Impact

Using this feature slightly improves the performance of the resource tracker update. Instead of iterating over the list of all ironic nodes to determine which should be managed, the compute service will iterate over a subset of ironic nodes.

Other deployer impact

The two configuration options mentioned above are added, but are optional. The feature isn’t enabled unless partition_key is set.

It’s worth noting what happens when a node’s conductor group changes. If the node has an instance, it continues being managed by the compute service responsible for the instance, as we do today with rebalancing the hash ring. Without an instance, the node will be picked up by a compute service managing the new group at the next resource tracker run after the conductor group changes.

Developer impact

None.

Upgrade impact

None.

Implementation

Assignee(s)

Primary assignee:

jroll

Work Items

  • Add the configuration options and the new code paths.

  • Add functional tests to ensure that the compute services manage the correct subset of nodes when this is enabled.

  • Add documentation for deployers and operators.

Dependencies

None.

Testing

This will need to be tested in functional tests, as it would require spinning up at least three nova-compute services to properly test the feature. While possible in integration tests, this isn’t a great use of CI resources.

Documentation Impact

Deployer and operator documentation will need updates.

References

This feature and its implementation was roughly agreed upon during the Stein PTG. See line 662 or so (at the time of this writing): https://etherpad.openstack.org/p/nova-ptg-stein

History

Revisions

Release Name

Description

Stein

Introduced