Scaling with the Ansible Inventory

https://blueprints.launchpad.net/tripleo/scaling-with-Ansible-inventory

Scaling an existing deployment should be possible by adding new host definitions directly to the Ansible inventory, and not having to increase the <Role>Count parameters.

Problem Description

Currently to scale a deployment, a Heat stack update is required. The stack update reflects the new desired node count of each role, which is then represented in the generated Ansible inventory. The inventory file is then used by the config-download process when ansible-playbook is executed to perform the software configuration on each node.

Updating the Heat stack with the new desired node count has posed some scaling challenges. Heat creates a set of resources associated with each node. As the number of nodes in a deployment increases, Heat has more and more resources to manage.

As the stack size grows, Heat must be tuned with software configurations or horizontally scaled with additional engine workers. However, horizontal scaling of Heat workers will only help so much as eventually other service workers would need to be scaled as well, such as database, messaging, or Keystone worker process. Having to increasingly scale worker processes results in additional physical resource consumption.

Heat performance also begins to degrade as stack size increases. It takes longer and longer for stack operations to complete as node count increases. The stack operation time often reaches into taking many hours, which is usually outside the range of typical maintenance windows.

It is also hard to predict what changes Heat will make. Often, no changes are desired other than to scale out to new nodes. However, unintended template changes or user error around forgetting to pass environment files poses additional unnecessary risk to the scaling operation.

Proposed Change

Overview

The proposed change would allow for users to directly add new node definitions to the Ansible inventory by way of a new Heat parameter to allow for scaling services onto those new nodes. No change in the <Role>Count parameters would be required.

A minimum set of data would be required when adding a new node to the Ansible inventory. Presently, this includes the TripleO role, and an IP address on each network that is used by that role.

Only scaling of already defined roles will be possible with this method. Defining new roles would still require a full Heat stack update which defined the new role.

Once the new node(s) are added to the inventory, ansible-playbook could be rerun with the config-download directory to scale the software services out on to the new nodes.

As increasing the node count in the Heat stack operation won’t be necessary when scaling, if baremetal provisioning is required for the new nodes, then this work depends on the nova-less-deploy work:

https://specs.openstack.org/openstack/tripleo-specs/specs/stein/nova-less-deploy.html

Once baremetal provisioning is migrated out of Heat with the above work, then new nodes can be provisioned with those new workflows before adding them directly to the Ansible inventory.

Since new nodes added directly to the Ansible inventory would still be consuming IP’s from the subnet ranges defined for the overcloud networks, Neutron needs to be made aware of those assignments so that there are no overlapping IP addresses. This could be done with a new interface in tripleo-heat-templates that allows for specifying the extra node inventory data. The parameter would be called ExtraInventoryData. The templates would take care of operating on that input and creating the appropriate Neutron ports to correspond to the IP addresses specified in the data.

When tripleo-ansible-inventory is used to generate the inventory, it would query Heat as it does today, but also layer in the extra inventory data as specified by ExtraInventoryData. The resulting inventory would be a unified view of all nodes in the deployment.

ExtraInventoryData may be a list of files that are consumed with Heat’s get_file function so that the deployer can keep their inventory data organized by file.

Alternatives

This change is primarily targeted at addressing scaling issues around the Heat stack operation. Alternative methods include using undercloud minions:

https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/undercloud_minion.html

Multi-stack/split-controlplane also addresses the issue somewhat by breaking up the deployment into smaller and more manageable stacks:

https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/distributed_compute_node.html

These alternatives are complimentary to the proposed solution here, and all of these solutions can be used together for the greatest benefits.

Direct manipulation of inventory data

Another alternative would be to not make use of any new interface in the templates such as the previously mentioned ExtraInventoryData. Users could just update the inventory file manually, or drop inventory files in a specified location (since Ansible can use a directory as an inventory source).

The drawbacks to this approach are that another tool would be necessary to create associated ports in Neutron so that there are no overlapping IP addresses. It could also be a manual step, although that is prone to error.

The advantages to this approach is that it would completely eliminate the stack update operation as part of the scaling. Not having any stack operation is appealing in some regards due to the potential to forget environment files or other user error (out of date templates, etc).

Security Impact

IP addresses and hostnames would potentially exist in user managed templates that have the value for ExtraInventoryData, however this is no different than what is present today.

Upgrade Impact

The upgrade process will need to be aware that not all nodes are represented in the Heat stack, and some will be represented only in the inventory. This should not be an issue as long as there is a consistent interface to get a single unified inventory as there exists now.

Any changes around creating the unified view of the inventory should be made within the implementation of that interface (tripleo-ansible-inventory) such that existing tooling continues to use an inventory that contains all nodes for a deployment.

Other End User Impact

Users will potentially have to manage additional environment files for the extra inventory data.

Performance Impact

Performance should be improved during scale out operations.

However, it should be noted that Ansible will face scaling challenges as well. While this change does not directly introduce those new challenges, it may expose them more rapidly as it bypasses the Heat scaling challenges.

For example, it is not expected that simply adding hundreds or thousands of new nodes directly to the Ansible inventory means that scaling operation would succeed. It would likely expose new scaling challenges in other tooling, such as the playbook and role tasks or Ansible itself.

Other Deployer Impact

Since this proposal is meant to align with the nova-less-deploy, all nodes (whether they are known to Heat or not) would be unprovisioned if the deployment is deleted.

If using pre-provisioned nodes, then there is no change in behavior in that deleting the Heat stack does not actually “undeploy” any software. This proposal does not change that behavior.

Developer Impact

Developers could more quickly test scaling by bypassing the Heat stack update completely if desired, or using the ExtraInventoryData interface.

Implementation

Assignee(s)

Primary assignee:

James Slagle <jslagle@redhat.com>

Work Items

  • Add new parameter ExtraInventoryData

  • Add Heat processing of ExtraInventoryData

    • create Neutron ports

    • add stack outputs

  • Update tripleo-ansible-inventory to consume from added stack outputs

  • Update HostsEntry to be generic

Dependencies

  • Depends on nova-less-deploy work for baremetal provisioning outside of Heat. If using pre-provisioned nodes, does not depend on nova-less-deploy.

  • All deployment configurations coming out of Heat need to be generic per role. Most of this work was complete in Train, however this should be reviewed. For example, the HostsEntry data is still static and Heat is calculating the node list. This data needs to be moved to an Ansible template.

Testing

Scaling is not currently tested in CI, however perhaps it could be with this change.

Manual test plans and other test automation would need to be updated to also test scaling with ExtraInventoryData.

Documentation Impact

Documentation needs to be added for ExtraInventoryData.

The feature should also be fully explained in that users and deployers need to be made aware of the change of how nodes may or may not be represented in the Heat stack.

References