Shard Key Introduction¶
https://storyboard.openstack.org/#!/story/2010378
After much discussion and attempts to remedy the scalability issues with
nova-compute
and its connection to Ironic in large scale deployments,
and upon newly discovered indicators of networking-baremetal
having a
similar scaling issue, the community has started to reach an agreement on
a path forward. Specifically, to introduce a sharding model which would
allow API consumers to map and lock on to specific sets of baremetal nodes,
regardless of if the relationship is semi-permanant or entirely situational.
Only the consumer of the information performing processing can make that
determination, and it is up to Ironic to try and provide the substrate
capabilities to efficiently operate against its API.
Problem description¶
The reality is Ironic can be used at some absurd scales in the hundreds of of thousands of baremetal nodes, and while most operators of Ironic either run multiple smaller distinct Ironic deployments with less than 500 physical machines, some need a single deployment with thousands or tens of thousands of physical nodes. At increased scales, external operations polling ironic, generally struggle to scale at these levels. It is also easy for misconfigurations to be made where performance can become degraded, which is because the scaling model and limits are difficult to understand.
This is observable with the operation of Nova’s Compute process when running
the nova.virt.ironic
driver. It is operationally easy to get into
situations where one is attempting to support thousands of baremetal nodes,
with too few nova-compute
processes. This specific situation leads to
the process attempting to take on more work than it was designed to handle.
Recently we discovered a case, while rooted in misconfiguration, where the
same basic scaling issue exists with networking-baremetal
where it is
responsible for polling and updating physical network mappings in Neutron.
The same basic case, a huge amount of work, and multiple processes.
In this specific case, multiple (3) Neutron services were stressing the Ironic
API retrieving all of the nodes, and attempting to update all of the related
physical network mapping records in neutron, resulting in the same record
being updated 3 times, once from each service.
The root issue is the software consuming Ironic’s data needs to be able to self-delineate the overall node set and determine the local separation points for sharding the nodes. The delineation is required because the processes executed are far more processor intensive, which can introduce latency and lag which can lead towards race conditions.
The challenge, from what has been done previously, is the previous model required downloading the entire data set to build a hash ring from.
Where things are also complicated, is Ironic has an operational model of
a conductor_group
, which is intended to help model a physical grouping
or operational constraint. The challenge here is that conductor groups are
not automatic in any way, shape, or form. As a result, conductor groups
is not the solution we need here.
Proposed change¶
Overall the idea, is to introduce a shard
field on the node object,
which an API user (Service), can utilize to retrieve a subset of nodes.
This new field on the node object would be inline with existing API field behavior constraints and can be set via the API.
We can provide a means to pre-set the shard, but ultimately it is still optional for Ironic, and the shard exists for the API consumer’s benefit.
In order to facilitate the usage by an API client, /v1/nodes
and
/v1/ports
would be updated to accept a shard
parameter
(i.e. GET /v1/nodes?shard=foo, GET /v1/ports?shard=foo,
GET /v1/portgroups?shard=foo) in the query to allow for API consumers
to automatically scope limit their data set and self determine how to
reduce the workset. For example, networking-baremetal
may not care
about assignment, it just needs to reduce the localized workset.
Whereas, nova-compute
needs the shard field to remain static, that is
unless nova-compute
or some other API consumer were to request the
shard
to be updated on a node.
Note
The overall process consumers use today is to retrieve everything and
then limit the scope of work based upon contents of the result set.
This results in a large overhead of work and increased looping latency
which also can encourage race conditions. Both nova-compute
and the networking-baremetal
ML2 plugin operate in this way with
different patterns of use. The advantage of the the proposed solution
is to enable the scope limiting/grouping into manageable chunks.
In terms of access controls, we would also add a new RBAC policy to restrict changes such that the system itself or a appropriately scoped (i.e. administrative) user can change the field.
In this model, conductors do not care about the shard key. It is only a data storage field on the node. Lookups for contents of the overall shard composition/layout, for GET /v1/shards, is to be performed directly against the nodes table using a SQL query.
Alternatives¶
This is a complex solution to allow simplified yet delineated usage, and there are numerous other options for specific details.
Ultimately, each item should be discussed, and considered.
One key aspect, which has been recognized thus far, is that existing
mechanisms can be inefficiently leveraged to achieve this. An example
of this is that conductor_group
, owner
, lessee
all allow for
filtering of the node result set. A conductor_group
being an explicit
aspect an API client can request, where as owner
and lessee
are
access control based filters tied to the API client’s submitted Project
ID used for Authentication. More information on why conductor_group
is problematic is further on in this document.
Consensus in discussion with the Nova teams seems to be that usage of the other fields, while in part may be useful, and possibly even preferred in some limited and specific cases, doesn’t solve the general need to be able to allow clients to self delineate without first downloading the entire node list first. Which in itself, the act of retrieving a complete list of nodes is a known scaling challenge, and creates increased processing latency.
In the conductor_group
case, there is no current way to discover
the conductor groups. Where as for owner
and lessee
, these are
specific project ID value fields.
Why not Conductor Group?¶
It is important to stress similarity wise, this is similar to conductor groups, however conductor groups were primarily purposed to model the physical constraints and structure of the baremetal infrastructure.
For example, if you have a set of conductors in Europe, and a set of
conductors in New York, you don’t want to try and run a deploy for servers
in New York, from Europe. Part of the attractiveness for this to be exposed
or used in Nova, was also to align the physical structure. The immediately
recognized bonus to operators was the list of nodes was limited to the running
nova-compute
process, if so configured. It is known to the Ironic community
that some infrastructure operators have utilized this setting and field to
facilitate scaling of their nova-compute
infrastructure, however these
operators have also encountered issues with this use pattern as well that
we hope to avoid with a shard key implementation.
Where the needs are different with this effort and the pre-existing
conductor groups, is that conductor groups are part of the hash ring modeling
behavior where as in the shards model conductors will operate without
consideration of the shard key value. We need disjointed modeling to support
API consumer centric usage so they can operate in logical units with distinct
selections of work.
Consumers may also care about the conductor_group
in addition to the
shard because needing to geographically delineate is separate from needing
smaller “chunks” of work, or in this case “groups of baremetal nodes” for
which a running process is responsible for.
In this specific case, conductor_group
is entirely a manually managed
aspect, which nova has a separate setting name due to name perception reasons,
and our hope ultimately is something that is both simple and smart.
Note
The Nova project has agreed during Project Teams Gathering meetings to
deprecate the peer_list
parameter they forced use of previously to
support conductor groups with the hash ring logic.
On top of this, Today’s conductor_group
functionality is reliant upon
the hash ring model of use, which is something the Nova team wants to see
removed from the Nova codebase in the next several development cycles.
Where as, Ironic will continue to use the hash ring functionality
for managing our conductor’s operating state as it is also modeled for
conductors to manage thousands of nodes. These thousands of nodes just
does not scale well into nova-compute
services.
Why not owner or lessee?¶
With the RBAC model improvements which have taken place over the last few
years, it is entirely possible to manage separate projects and credentials
for a nova-compute
to exist and operate within. The challenge here is
management of additional credentials and the mappings/interactions.
It might be “feasible” to do the same for scaling networking-baremetal
interactions with Ironic’s API, but the overhead and self management of
node groupings seems onerous and error prone.
Also, if this was a path taken, it would also be administratively prohibitive for nova-computes nodes, and they would be locked to the manual settings.
What if we just let the API consumer figure it out?¶
This could be an option, but it would lead to worse performance and the user experience being worse.
The base conundrum is to orderly and efficiently enumerate through, and then acting upon each and every node API client is responsible for interacting with.
Today, Nova’s Compute service enumerates through every node, using a list generated upon one query, and it gets most of the data it needs to track/interact with a node, keeping the more costly single node requests to a minimum. If that client had to track things, it would still have to pull a full list, and then it would have to reconcile, track, and map individual nodes. We’ve already seen this as not working using a Hashring today.
Similarly, networking-baremetal
lists all ports. That is all it needs,
but it has no concept of smaller chunking, blocks, or even enough information
to really make a hashring which would represent existing models. To just
expect the client to “figure it out” and to “deal with that complexity”,
also means logic far away from a database. And for performance, the closer
we can keep logic and decisions to an indexed column, the better and more
performant, which is why the proposed solution has come forth.
Data model impact¶
- Node: Addition of a
shard
column/value string field, indexed, with a default value of None. This field is considered to be case sensitive, which is inline with the DB storage type. API queries would seek exact field value matches.
Note
We will need to confer with the Nova team and the nova.virt.ironic driver query pattern, to ensure we cover any compound indexes, if needed.
To facilitate this, database migrations, and data model sanity checking
will need to be added to ironic-status
as part of the upgrade checks.
State Machine Impact¶
None
REST API impact¶
PATCH /v1/nodes/<node>
In order to set a shard value, a user will need to patch the field. This is canned functionality of the existing nodes controller, and will be API version and RBAC policy guarded in order to prevent inappropriate changes to the field once set. Like all other fields, this operation takes the shape of a JSON Patch.
GET /v1/nodes?shard=VALUE,VALUE2,VALUE3
Returns a subset of nodes limited by shard key. In this specific case we will also allow a string value of “none”, “None” or “null” to be utilized to retrieve a list of nodes which do not have a shard key set. Logic to handle that would be in the DB API layer.
GET /v1/ports?shard=VALUE,VALUE2,VALUEZ GET /v1/portgroupss?shard=VALUE,VALUE2,VALUEZ
Returns a subset of ports, limited by the shard key, or list of keys provided by the caller. Specifically would utilize a joined query to the database to facilitate it.
GET /v1/shards
Returns a JSON representing the shard keys and counts of nodes utilizing the shard.
{{“Name”: “Shard-10”, “Count”: 352}, {“Name”: “Shard-11”, “Count”: 351}, {“Name”: “Shard-12”, “Count”: 35}, {“Name”: null, “Count”: 921}}
Visibility wise, the new capabilities will be restricted by API
micro-version. Access wise this field would be restricted in use to
system-reader
, project-admin
, and future service
roles
by default. A specific RBAC policy would be added for access to
this endpoint.
Note
The /v1/shards endpoint will be read only.
Client (CLI) impact¶
Typically, but not always, if there are any REST API changes, there are corresponding changes to python-ironicclient. If so, what does the user interface look like. If not, describe why there are REST API changes but no changes to the client.
“openstack baremetal” CLI¶
A baremetal shard list
command would be added.
A baremetal node list --shard <shard>
capability would be
added to list all nodes in a shard.
A --shard
node level parameter for baremetal node set
would also be added.
A baremetal port list --shard <shard>
capability would be
added to limit the related ports to nodes in a shard. Similarly,
the baremetal portgroup list --shard <shard>
would be updated
as well.
“openstacksdk”¶
A SDK method would be added to get a shard list, and existing list methods would be checked to ensure we can query by shard.
RPC API impact¶
None anticipated at this time.
Driver API impact¶
None
Nova driver impact¶
A separate specification document is being proposed for the Nova project to help identify and navigate the overall change.
That being said, no direct negative impact is anticipated.
The overall discussion revolving with Nova is to both facilitate a minimal impact migration, and not force invasive and breaking changes, which may not be realistically needed by the operators.
Note
An overall migration path is envisioned, but what is noted here is only a suggestion and how we perceive the overall process.
Anticipated initial Nova migration steps:
Ironic itself will not be providing an explicit process for setting the
shard value on each node, aside from baremetal node set
. Below is what
we, Ironic anticipate as the migration steps overall to move towards this
model.
Complete the Ironic migration. Upon completion, executing the database status check (i.e.
ironic-status upgrade check
) should detect and warn if ashard
key is present on nodes in the database, but nodes exist without ashard
value are present in the database.The nova-compute service being upgraded is shut down.
A nova-manage command would be executed to reassign nodes to a user supplied
shard
value to match. Example: nova-manage ironic-reassign <shard-key> <compute-hostname>Programmatically, this would retrieve a list of nodes matching the key from Ironic, and then change the associated ComputeNode and Instance tables host fields to be the supplied compute hostname, to match an existing nova compute service.
Note
The command likely needs to match/validate that this is/was a compute hostname.
Todo
As a final step before the nova-manage command exits, ideally it would double check the state of records in those tables to indicate if there are other nodes the named Compute hostname is responsible for. The last compute hostname in the environment should not generate any warning, any warning would be indicitive of a lost ComputeNode, Instance, or Baremetal node record.
The nova-compute.conf file for the upgraded
nova-compute
service is restarted with amy_shard
(or other appropriate parameter) which signals to thenova.virt.ironic
driver code to not utilize the hash ring, and to utilize the blend of what it thinks it is responsible for from the database and what matches the Ironic baremetal node inventory when queried for matching the configured shard key value.As additional compute nodes are migrated to using the new shard key setup, existing compute node imbalance should be settled in terms of the internal compute-node logic to retrieve what each node it thinks it is responsible for, and would eventually match the shard key.
This would facilitate an ability to perform a rolling, yet isolated outage impact as the new nova-compute configuration is coming online, and also allows for a flow which should be able to be automated for larger operators.
The manageability, say if one needs to change a shard
or rebalance
shards, is not yet clear. The current discussion in the Nova project is that
rebalance/reassociation will only be permitted IF the compute service
has been “forced down” which is an irreversible action
Ramdisk impact¶
None
Security impact¶
The shard
key would be API user settable, as long as sufficient
API access exists in the RBAC model.
The /v/shards
endpoint would also be restricted based upon the RBAC
model.
No other security impacts are anticipated.
Other end user impact¶
None Anticipated
Scalability impact¶
This model is anticipated to allow users of data stored in Ironic to be more scalable. No impacts to Ironic’s scalability are generally anticipated.
Performance Impact¶
No realistic impact is anticipated. While another field is being added, initial prototyping benchmarks have yielded highly performant response times for large sets (10,000) baremetal nodes.
Other deployer impact¶
It is recognized that operators may wish to auto-assign or auto-shard the node set programmatically. The agreed upon limitation amongst Ironic contributors is that we (Ironic) would not automatically create new shards in the future. Creation of new shards would be driven by the operator by setting a new shard key on any given node.
This may require a new configuration option to control this logic, but the logic overall is not viewed as a blocking aspect to the more critical need of being able to “assign” a node to a shard. This logic may be added later on, we will just endeveour to have updated documentation to explain the appropriate usage and options.
Developer impact¶
None anticipated
Implementation¶
Assignee(s)¶
- Primary assignee:
Jay Faulkner (JayF)
- Other contributors:
Julia Kreger (TheJulia)
Work Items¶
Propose nova spec for the use of the keys (https://review.opendev.org/c/openstack/nova-specs/+/862833)
Create database schema/upgrades/models.
Update Object layer for the
Node
andPort
objects in order to permit both objects to be queried byshard
.Add query by shard capability to the Nodes and Ports database tables.
Expose
shard
on the node API, with an incremented microversion and implement a new RBAC policy which restricts the ability to change theshard
valueAdd pre-upgrade status check to warn if there are fields which are not consistently populated. i.e.
shard
is not populated on all nodes. This will provide visibility into the mixed and possibly misconfigured operational state for future upgrader.Update OpenStack SDK and python-ironicclient
Dependencies¶
This specification is loosely dependent upon Nova accepting a plan for use of the sharding model of data. At present, it is the Ironic team’s understanding that it is acceptable to Nova, and Ironic needs to merge this spec and related code to support this feature before Nova will permit the Nova spec to be merged.
Testing¶
Unit testing is expected for all the basic components and operations added to Ironic to support this functionality.
We may be able to add some tempest testing for the API field and access interactions.
Upgrades and Backwards Compatibility¶
To be determined. We anticipate that the standard upgrade process would apply and that there would not realistically be an explicit downgrade compatibility process, but this capability and functionality is largely for external consumption, and details there are yet to be determined.
Documentation Impact¶
Admin documentation would need to include an document covering sharding, internal mechanics, and usage.
References¶
PTG Notes: https://etherpad.opendev.org/p/nova-antelope-ptg Bug: https://launchpad.net/bugs/1730834 Bug: https://launchpad.net/bugs/1825876 Related Bug: https://launchpad.net/bugs/1853009