Provide a common Validation Framework inside python-tripleoclient

https://blueprints.launchpad.net/tripleo/+spec/validation-framework

Currently, we’re lacking a common validation framework in tripleoclient. This framework should provide an easy way to validate environment prior deploy and prior update/upgrade, on both undercloud and overcloud.

Problem Description

Currently, we have two types of validations:

  • Those launched prior the undercloud deploy, embedded into the deploy itself

  • Those launched at will via a Mistral Workflow

There isn’t any unified way to call any validations by itself in an easy way, and we lack the capacity to easily add new validations for the undercloud preflight checks.

The current situation is not optimal, as the operator must go in the UI in order to run validations - there is a way to run them from the CLI, using the exact same workflows as the UI. This can’t be used in order to get proper preflight validations, especially when we don’t get a working Mistral (prior the undercloud deploy, or with all-on-one/standalone).

Moreover, there is a need to make the CLI and UI converge. The latter already uses the full list of validations. Adding the full support of tripleo-validations to the CLI will improve the overall quality, usability and maintenance of the validations.

Finally, a third type should be added: service validations called during the deploy itself. This doesn’t directly affect the tripleoclient codebase, but tripleo-heat-templates.

Proposed Change

Overview

In order to improve the current situation, we propose to create a new “branching” in the tripleoclient commands: openstack tripleo validator

This new subcommand will allow to list and run validations in an independent way.

Doing so will allow to get a clear and clean view on the validations we can run depending on the stage we’re in.

(Note: the subcommand has yet to be defined - this is only a “mock-up”.)

The following subcommands should be supported:

  • openstack tripleo validator list: will display all the available validations with a small description, like “validate network capabilities on undercloud”

  • openstack tripleo validator run: will run the validations. Should take options, like:

    • --validation-name: run only the passed validation.

    • --undercloud: runs all undercloud-related validations

    • --overcloud: runs all overcloud-related validations

    • --use-mistral: runs validations through Mistral

    • --use-ansible: runs validations directly via Ansible

    • --plan: allows to run validations against specific plan. Defaults to $TRIPLEO_PLAN_NAME or “overcloud”

  • in addition, common options for all the subcommands:

    • --extra-roles: path to a local directory containing validation roles maintained by the operator, or swift directory containing extra validation roles.

    • --output: points to a valid Ansible output_callback, such as the native json, or custom validation_output. The default one should be the latter as it renders a “human readable” output. More callbacks can be added later.

The --extra-roles must support both local path and remote swift container, since the custom validation support will push any validation to a dedicated swift directory.

The default engine will be determined by the presence of Mistral: if Mistral is present and accepting requests (meaning the Undercloud is most probably deployed), the validator has to use it by default. If no Mistral is present, it must fallback on the ansible-playbook.

The validations should be in the form of Ansible roles, in order to be easily accessed from Mistral as well (as it is currently the case). It will also allow to get a proper documentation, canvas and gives the possibility to validate the role before running it (ensuring there are metadata, output, and so on).

We might also create some dedicated roles in order to make a kind of “self validation”, ensuring we actually can run the validations (network, resources, and so on).

The UI uses Mistral workflows in order to run the validations - the CLI must be able to use those same workflows of course, but also run at least some validations directly via ansible, especially when we want to validate the undercloud environment before we even deploy it.

Also, in order to avoid Mistral modification, playbooks including validation roles will be created.

In the end, all the default validation roles should be in one and only one location: tripleo-validations. The support for “custom validations” being added, such custom validation should also be supported (see references for details).

In order to get a proper way to “aim” the validations, proper validation groups must be created and documented. Of course, one validation can be part of multiple groups.

In addition, a proper documentation with examples describing the Good Practices regarding the roles content, format and outputs should be created.

For instance, a role should contain a description, a “human readable error output”, and if applicable a possible solution.

Proper testing for the default validations (i.e. those in tripleo-validations) might be added as well in order to ensure a new validation follows the Good Practices.

We might want to add support for “nagios-compatible outputs” and exit codes, but it is not sure running those validations through any monitoring tool is a good idea due to the possible load it might create. This has to be discussed later, once we get the framework in place.

Alternatives

No real alternatives in fact. Currently, we have many ways to validate, but they are all unrelated, not concerted. If we don’t provide a unified framework, we will get more and more “side validations ways” and it won’t be maintainable.

Security Impact

Rights might be needed for some validations - they should be added accordingly in the system sudoers, in a way that limits unwanted privilege escalations.

Other End User Impact

The end user will get a proper way to validate the environment prior to any action. This will give more confidence in the final product, and ease the update and upgrade processes.

It will also provide a good way to collect information about the systems in case of failures.

If a “nagios-compatible output” is to be created (mix of ansible JSON output, parsing and compatibility stuff), it might provide a way to get a daily report about the health of the stack - this might be a nice feature, but not in the current scope (will need a new stdout_callback for instance).

Performance Impact

The more validations we get, the more time it might take IF we decide to run them by default prior any action.

The current way to disable them, either with a configuration file or a CLI option will stay.

In addition, we can make a great use of “groups” in order to filter out greedy validations.

Other Deployer Impact

Providing a CLI subcommand for validation will make the deployment easier.

Providing a unified framework will allow an operator to run the validations either from the UI, or from the CLI, without any surprise regarding the validation list.

Developer Impact

A refactoring will be needed in python-tripleoclient and probably in tripleo-common in order to get a proper subcommand and options.

A correct way to call Ansible from Python is to be decided (ansible-runner?).

A correct way to call Mistral workflows from the CLI is to be created if it does not already exist.

In the end, the framework will allow other Openstack projects to push their own validations, since they are the ones knowing how and what to validate in the different services making Openstack.

All validations will be centralized in the tripleo-validations repository. This means we might want to create a proper tree in order to avoid having 100+ validations in the same directory.

Implementation

Assignee(s)

Primary assignee:

cjeanner

Other contributors:

akrivoka ccamacho dpeacock florianf

Work Items

  • List current existing validations in both undercloud_preflight.py and openstack-tripleo-validations.

  • Decide if we integrate ansible-runner as a dependency (needs to be packaged).

  • Implement the undercloud_preflight validations as Ansible roles.

  • Implement a proper way to call Ansible from the tripleoclient code.

  • Implement support for a configuration file dedicated for the validations.

  • Implement the new subcommand tree in tripleoclient.

  • Validate, Validate, Validate.

Dependencies

Testing

The CI can’t possibly provide the “right” environment with all the requirements. The code has to implement a way to configure the validations so that the CI can override the productive values we will set in the validations.

Documentation Impact

A new entry in the documentation must be created in order to describe this new framework (for the devs) and new subcommand (for the operators).

References