Validations in TripleO Workflows¶
The Newton release introduced TripleO validations – a set of extendable checks that identify potential deployment issues early and verify that the deployed OpenStack is set up properly. These validations are automatically being run by the TripleO UI, but there is no support for the command line workflow and they’re not being exercised by our CI jobs either.
When enabled, TripleO UI runs the validations at the appropriate phase of the planning and deployment. This is done within the TripleO UI codebase and therefore not available to python-tripleoclient or the CI.
The TripleO deployer can run the validations manually, but they need to know at which point to do so and they will need to do it by calling Mistral directly.
This causes a disparity between the command line and GUI experience and complicates the efforts to exercise the validations by the CI.
Each validation already advertises where in the planning/deployment
process it should be run. This is under the
section. In addition, the
Mistral workflow lets us run all validations belonging to a given
For each validation group (currently
post-deployment) we will update the appropriate workflow in
tripleo-common to optionally call
Each of the workflows above will receive a new Mistral input called
run_validations. It will be a boolean value that indicates whether
the validations ought to be run as part of that workflow or not.
To expose this functionality to the command line user, we will add an
option for enabling/disabling validations into python-tripleoclient
(which will set the
run_validations Mistral input) and a way to
show the results of each validation to the screen output.
When the validations are run, they will report their status to Zaqar and any failures will block the deployment. The deployer can disable validations if they wish to proceed despite failures.
One unresolved question is the post-deployment validations. The Heat stack create/update Mistral action is currently asynchronous and we have no way of calling actions after the deployment has finished. Unless we change that, the post-deployment validations may have to be run manually (or via python-tripleoclient).
Document where to run each group and how and leave it at that. This risks that the users already familiar with TripleO may miss the validations or that they won’t bother.
We would still need to find a way to run validations in a CI job, though.
Provide subcommands to run validations (and groups of validations) into python-tripleoclient and rely on people running them manually.
This is similar to 1., but provides an easier way of running a validation and getting its result.
Note that this may be a useful addition even if with the proposal outlined in this specification.
Do what the GUI does in python-tripleoclient, too. The client will know when to run which validation and will report the results back.
The drawback is that we’ll need to implement and maintain the same set of rules in two different codebases and have no API to do them. I.e. what the switch to Mistral is supposed to solve.
Other End User Impact¶
We will need to modify python-tripleoclient to be able to display the status of validations once they finished. TripleO UI already does this.
The deployers may need to learn about the validations.
Running a validation can take about a minute (this depends on the nature of the validation, e.g. does it check a configuration file or does it need to log in to all compute nodes).
This may can be a concern if we run multiple validations at the same time.
We should be able to run the whole group in parallel. It’s possible
we’re already doing that, but this needs to be investigated.
with-items run the tasks in sequence or in
There are also some options that would allow us to speed up the running time of a validation itself, by using common ways of speeding up Ansible playbooks in general:
Disabling the default “setup” task for validations that don’t need it (this task gathers hardware and system information about the target node and it takes some time)
Using persistent SSH connections
Making each validation task run independently (by default, Ansible runs a task on all the nodes, waits for its completion everywhere and then moves on to another task)
Each validation runs the
tripleo-ansible-inventoryscript which gathers information about deployed servers and configuration from Mistral and Heat. Running this script can be slow. When we run multiple validations at the same time, we should generate the inventory only once and cache the results.
Since the validations are going to be optional, the deployer can always choose not to run them. On the other hand, any slowdown should ideally outweigh the time spent investigating failed deployments.
We will also document the actual time difference. This information should be readily available from our CI environments, but we should also provide measurements on the bare metal.
Other Deployer Impact¶
Depending on whether the validations will be run by default or not, the only impact should be an option that lets the deployer to run them or not.
The TripleO developers may need to learn about validations, where to find them and how to change them.
- Primary assignee:
- Other contributors:
Work items or tasks – break the feature up into the things that need to be done to implement it. Those parts might end up being done by different people, but we’re mostly trying to understand the timeline for implementation.
run_validationsinput and call
run_groupsfrom the deployment and node registration workflows
Add an option to run the validations to python-tripleoclient
Display the validations results with python-tripleoclient
Add or update a CI job to run the validations
Add a CI job to tripleo-validations
This should make the validations testable in CI. Ideally, we would verify the expected success/failure for the known validations given the CI environment. But having them go through the testing machinery would be a good first step to ensure we don’t break anything.
We will need to document the fact that we have validations, where they live and when and how are they being run.