Unifying TripleO Orchestration with Task-Core and Directord

Include the URL of your launchpad blueprint: https://blueprints.launchpad.net/tripleo/+spec/unified-orchestration

The purpose of this spec is to introduce core concepts around Task-Core and Directord, explain their benefits, and cover why the project should migrate from using Ansible to using Directord and Task-Core.

TripleO has long been established as an enterprise deployment solution for OpenStack. Different task executions have been used at different times. Originally, os-collect-config was used, then the switch to Ansible was completed. A new task execution environment will enable moving forward with a solution designed around the specific needs of TripleO.

The tools being introduced are Task-Core and Directord.

Task-Core:

A dependency management and inventory graph solution which allows operators to define tasks in simple terms with robust dominion over a given environment. Declarative dependencies will ensure that if a container/config is changed, only the necessary services are reloaded/restarted. Task-Core provides access to the right tools for a given job with provenance, allowing operators and developers to define outcomes confidently.

Directord:

A deployment framework built to manage the data center life cycle, which is both modular and fast. Directord focuses on consistently maintaining deployment expectations with a near real-time level of performance at almost any scale.

Problem Description

Task execution in TripleO is:

  • Slow

  • Resource intensive

  • Complex

  • Defined in a static and sequential order

  • Not optimized for scale

TripleO presently uses Ansible to achieve its task execution orchestration goals. While the TripleO tooling around Ansible (playbooks, roles, modules, plugins) has worked and is likely to continue working should maintainers bear an increased burden, future changes around direction due to Ansible Execution Environments provide an inflection point. These upstream changes within Ansible, where it is fundamentally moving away from the TripleO use case, force TripleO maintainers to take on more ownership for no additional benefit. The TripleO use case is actively working against the future direction of Ansible.

Further, the Ansible lifecycle has never matched that of TripleO. A single consistent and backwards compatible Ansible version can not be used across a single version of TripleO without the tripleo-core team committing to maintain that version of Ansible, or commit to updating the Ansible version in a stable TripleO release. The cost to maintain a tool such as Ansible that the core team does not own is high vs switching to custom tools designed specifically for the TripleO use case.

The additional cost of maintaining Ansible as the task execution engine for TripleO, has a high likelihood of causing a significant disruption to the TripleO project; this is especially true as the project looks to support future OS versions.

Presently, there are diminishing benefits that can be realized from any meaningful performance, scale, or configurability improvments. The simplification efforts and work around custom Ansible strategies and plugins have reached a conclusion in terms of returns.

While other framework changes to expose scaling mechanisms, such as using --limit or partitioning of the ansible execution across multiple stacks or roles do help with the scaling problem, they are however in the category of work arounds as they do not directly address the inherent scaling issues with task executions.

Proposed Change

To make meaningful task execution orchestration improvements, TripleO must simplify the framework with new tools, enable developers to build intelligent tasks, and provide meaningful performance enhancements that scale to meet operators’ expectations. If TripleO can capitalize on this moment, it will improve the quality of life for day one deployers and day two operations and upgrades.

The proposal is to replace all usage of Ansible with Directord for task execution, and add the usage of Task-Core for dynamic task dependencies.

In some ways, the move toward Task-Core and Directord creates a General-Problem, as it’s proposing the replacement of many bespoke tools, which are well known, with two new homegrown ones. Be that as it may, much attention has been given to the user experience, addressing many well-known pain points commonly associated with TripleO environments, including: scale, barrier to entry, execution times, and the complex step process.

Overview

This specification consists of two parts that work together to achieve the project goals.

Task-Core:

Task-Core builds upon native OpenStack libraries to create a dependency graph and executes a compiled solution. With Task-Core, TripleO will be able to define a deployment with dependencies instead of brute-forcing one. While powerful, Task-Core keeps development easy and consistent, reducing the time to deliver and allowing developers to focus on their actual deliverable, not the orchestration details. Task-Core also guarantees reproducible builds, runtime awareness, and the ability to resume when issues are encountered.

  • Templates containing step-logic and ad-hoc tasks will be refactored into Task-Core definitions.

  • Each component can have its own Task-Core purpose, providing resources and allowing other resources to depend on it.

  • The invocation of Task-Core will be baked into the TripleO client, it will not have to be invoked as a separate deployment step.

  • Advanced users will be able to use Task-Core to meet their environment expectations without fully understanding the deployment nuance of multiple bespoke systems.

  • Employs a validation system around inputs to ensure they are correct before starting the deployment. While the validation wont ensure an operational deployment, it will remove some issues caused by incorrect user input, such as missing dependent services or duplicate services; providing early feedback to deployers so they’re able to make corrections before running longer operations.

Directord:

Directord provides a modular execution platform that is aware of managed nodes. Because Directord leverages messaging, the platform can guarantee availability, transport, and performance. Directord has been built from the ground up, making use of industry-standard messaging protocols which ensure pseudo-real-time performance and limited resource utilization. The built-in DSL provides most of what the TripleO project will require out of the box. Because no solution is perfect, Directord utilizes a plugin system that will allow developers to create new functionality without compromise or needing to modify core components. Additionally, plugins are handled the same, allowing Directord to ensure the delivery and execution performance remain consistent.

  • Directord is a single application that is ideally suited for containers while also providing native hooks into systems; this allows Directord to operate in heterogeneous environments. Because Directord is a simplified application, operators can choose how they want to run it and are not forced into a one size fits all solution.

  • Directord is platform-agnostic, allowing it to run across systems, versions, and network topologies while simultaneously guaranteeing it maintains the smallest possible footprint.

  • Directord is built upon messaging, giving it the unique ability to span network topologies with varying latencies; messaging protocols compensate for high latency environments and will finally give TripleO the ability to address multiple data-centers and fully embrace “the edge.”

  • Directord client/server communication is secured (TLS, etc) and encrypted.

  • Directord node management to address unreachable or flapping clients.

With Task-Core and Directord, TripleO will have an intelligent dependency graph that is both easy to understand and extend. TripleO will now be aware of things like service dependencies, making it possible to run day two operations quickly and more efficiently (e.g, update and restart only dependent services). Finally, TripleO will shrink its maintenance burden by eliminating Ansible.

Alternatives

Stay the course with Ansible

Continuing with Ansible for task execution means that the TripleO core team embraces maintaining Ansible for the specific TripleO use case. Additionally, the TripleO project begins documenting the scale limitations and the boundaries that exist due to the nature of task execution. Focus needs to shift to the required maintenance necessary for functional expectations TripleO. Specific Ansible versions also need to be maintained beyond their upstream lifecycle. This maintenance would likely include maintaining an Ansible branch where security and bug fixes could be backported, with our own project CI to validate functionality.

TripleO could also embrace the use of Ansible Execution Environments through continued investigative efforts. Although, if TripleO is already maintaining Ansible, this would not be strictly required.

Security Impact

Task-Core and Directord are two new tools and attack surfaces, which will require a new security assessment to be performed to ensure the tooling exceeds the standard already set. That said, steps have already been taken to ensure the new proposed architecture is FIPS compatible, and enforces transport encryption.

Directord also uses ssh-python for bootstrapping tasks.

Ansible will be removed, and will no longer have a security impact within TripleO.

Upgrade Impact

The undercloud can be upgraded in place to use Directord and Task-Core. There will be upgrade tasks that will migrate the undercloud as necessary to use the new tools.

The overcloud can also be upgraded in place with the new tools. Upgrade tasks will be migrated to use the Directord DSL just like deployment tasks. This spec proposes no changes to the overcloud architecture itself.

As part of the upgrade task migration, the tasks can be rewritten to take advantage of the new features exposed by these tools. With the introduction of Task-Core, upgrade tasks can use well-defined dependencies for dynamic ordering. Just like deployment, update/upgrade times will be decreased due to the aniticipated performance increases.

Other End User Impact

When following the happy path, the end-user, deployers, and operators will not interact with this change as the user interface will effectively remain the same. However the user experience will change. Operators accustomed to Ansible tasks, logging, and output, will instead need to become familiar with those same aspects of Directord and Task-Core.

If an operator wishes to leverage the advanced capabilities of either Task-Core or Directord, the tooling will have documented end user interfaces available for interfaces such as custom components and orchestrations.

It should be noted that there’s a change in deployment architecture in that Directord follows a server/client model; albeit an ephemeral one. This change aims to be fully transparent, however, it is something that end users, deployers, will need to be aware of.

Performance Impact

This specification will have a positive impact on performance. Due to the messaging architecture of Directord, near-realtime task execution will be possible in parallel across all nodes.

  • Performance analysis has been done comparing configurability and runtime of Directord vs. Ansible, the TripleO default orchestration tool. This analysis highlights some of the performance gains this specification will provide; initial testing suggests that Task-Core and Directord is more than 10x faster than our current tool chain, representing a potential 90% time savings in just the task execution overhead.

  • One of the goals of this specification is to remove impediments in the time to work. Deployers should not be spending exorbitant time waiting for tools to do work; in some cases, waiting longer for a worker to be available than it would take to perform a task manually.

  • Improvements from being able to execute more efficiently in parallel. The Ansible strategy work allowed us to run tasks from a given Ansible play in parallel accoss the nodes. However this was limited to a effectively a single play per node in terms of execution. The granularity was limited to a play such that an Ansible play that with 100 items of work for one role and 10 items of work would be run in parallel on the nodes. The role with 10 items of work would likely finish first and the overall execution would have to wait until the entire play was completed everywhere. The long pole for a play’s execution is the node with the most set of tasks. With the transition to task-core and directord, the overall unit of work is an orchestration which may have 5 tasks. If we take the same 100 tasks for one role and split them up into 20 orchestrations that can be run in parallel, and the 10 items of work into two orchestrations for the other roles. We are able to better execute the work in parallel when there are no specific ordering requirements. Improvements are expected around host prep tasks and other services where we do not have specific ordering requirements. Today these tasks get put in a random spot within a play and have to wait on other unrelated tasks to complete before being run. We expect there to be less execution overhead time per the other items in this section, however the overall improvements are limited based on how well we can remove unnecessary ordering requirements.

  • Deployers will no longer be required to run a massive server for medium-scale deployment. Regardless of size, the memory footprint and compute cores needed to execute a deployment will be significantly reduced.

Other Deployer Impact

Task-Core and Directord represent an unknown factor; as such, they are not battle-tested and will create uncertainty in an otherwise “stable” project.

Deployers will experience the time savings of doing deployments. Deployers who implement new services will need to do so with Directord and Task-Core.

Extensive testing has been done; all known use-cases, from system-level configuration to container pod orchestration, have been covered, and automated tests have been created to ensure nothing breaks unexpectedly. Additionally, for the first time, these projects have expectations on performance, with tests backing up those claims, even at a large scale.

At present, TripleO assumes SSH access between the Undercloud and Overcloud is always present. Additionally, TripleO believes the infrastructure is relatively static, making day two operations risky and potentially painful. Task-Core will reduce the computational burden when crafting action plans, and Directord will ensure actions are always performed against the functional hosts.

Another improvement this specification will enhance is in the area of vendor integrations. Vendors will be able to provide meaningful task definitions which leverage an intelligent inventory and dependency system. No longer will TripleO require vendors have in-depth knowledge of every deployment detail, even those outside of the scope of their deliverable. By easing the job definitions, simplifying the development process, and speeding up the execution of tasks are all positive impacts on deployers.

Test clouds are still highly recommended sources of information; however, system requirements on the Undercloud will reduce. By reducing the resources required to operate the Undercloud, the cost of test environments, in terms of both hardware and time, will be significantly lowered. With a lower barrier to entry developers and operators alike will be able to more easily contribute to the overall project.

Developer Impact

To fully realize the benefits of this specification Ansible tasks will need to be refactored into the Task-Core scheme. While Task-Core can run Ansible and Directord has a plugin system which easily allows developers to port legacy modules into Directord plugins, there will be a developer impact as the TripleO development methodology will change. It’s fair to say that the potential developer impact will be huge, yet, the shift isn’t monumental. Much of the Ansible presently in TripleO is shell-oriented, and as such, it is easily portable and as stated, compatibility layers exist allowing the TripleO project to make the required shift gradually. Once the Ansible tasks are ported, the time saved in execution will be significant.

Example Task-Core and Directord implementation for Keystone:

While this implementation example is fairly basic, it does result in a functional Keystone environment and in roughly 5 minutes and includes services like MySQL, RabbitMQ, Keystone as well as ensuring that the operating systems is setup and configured for a cloud execution environment. The most powerful aspect of this example is the inclusion of the graph dependency system which will allow us easily externalize services.

  • The use of advanced messaging protocols instead of SSH means TripleO can more efficiently address deployments in local data centers or at the edge

  • The Directord server and storage can be easily offloaded, making it possible for the TripleO Client to be executed from simple environments without access to the overcloud network; imagine running a massive deployment from a laptop.

Implementation

In terms of essential TripleO integration, most of the work will occur within the tripleoclient, with the following new workflow.

Execution Workflow:

┌────┐   ┌─────────────┐   ┌────┐   ┌─────────┐   ┌─────────┬──────┐   ???????????
│USER├──►│TripleOclient├──►│Heat├──►│Task-Core├──►│Directord│Server├──►? Network ?
└────┘   └─────────────┘   └────┘   └─────────┘   └─────────┴──────┘   ???????????
                ▲                                             ▲             ▲
                │                       ┌─────────┬───────┐   |             |
                └──────────────────────►│Directord│Storage│◄──┘             |
                                        └─────────┴───────┘                 |
                                                                            |
                                                  ┌─────────┬──────┐        |
                                                  │Directord│Client│◄───────┘
                                                  └─────────┴──────┘
  • Directord|Server - Task executor connecting to client.

  • Directord|Client - Client program running on remote hosts connecting back to the Directord|Server.

  • Directord|Storage - An optional component, when not externalized, Directord will maintain the runtime storage internally. In this configuration Directord is ephemeral.

To enable a gradual transition, ansible-runner has been implemented within Task-Core, allowing the TripleO project to convert playbooks into tasks that rely upon strongly typed dependencies without requiring a complete rewrite. The initial implementation should be transparent. Once the Task-Core hooks are set within tripleoclient functional groups can then convert their tripleo-ansible roles or ad-hoc Ansible tasks into Directord orchestrations. Teams will have the flexibility to transition code over time and are incentivized by a significantly improved user experience and shorter time to delivery.

Assignee(s)

Primary assignee:
  • Cloudnull - Kevin Carter

  • Mwhahaha - Alex Schultz

  • Slagle - James Slagle

Other contributors:
  • ???

Work Items

  1. Migrate Directord and Task-Core to the OpenStack namespace.

  2. Package all of Task-Core, Directord, and dependencies for pypi

  3. RPM Package all of Task-Core, Directord, and dependencies for RDO

  4. Directord container image build integration within TripleO / tcib

  5. Converge on a Directord deployment model (container, system, hybrid).

  6. Implement the Task-Core code path within TripleO client.

  7. Port in template Ansible tasks to Directord orchestrations.

  8. Port Ansible roles into Directord orchestrations.

  9. Port Ansible modules and actions into pure Python or Directord components

  10. Port Ansible workflows in tripleoclient into pure Python or Directord orchestrations.

  11. Migration tooling for Heat templates, Ansible roles/modules/actions.

  12. Port Ansible playbook workflows in tripleoclient to pure Python or Directord orchestrations.

  13. Undercloud upgrade tasks to migrate to Directord + Task-Core architecture

  14. Overcloud upgrade tasks to migrate to enable Directord client bootstrapping

Dependencies

Both Task-Core and Directord are dependencies, as they’re new projects. These dependencies may or may not be brought into the OpenStack namespace; regardless, both of these projects, and their associated dependencies, will need to be packaged and provided for by RDO.

Testing

If successful, the implementation of Task-Core and Directord will leave the existing testing infrastructure unchanged. TripleO will continue to function as it currently does through the use of the tripleoclient.

New tests will be created to ensure the Task-Core and Directord components remain functional and provide an SLA around performance and configurability expectations.

Documentation Impact

Documentation around Ansible will need to be refactored.

New documentation will need to be created to describe the advanced usage of Task-Core and Directord. Much of the client interactions from the “happy path” will remain unchanged.

References