The Newton release introduced TripleO validations – a set of extendable checks that identify potential deployment issues early and verify that the deployed OpenStack is set up properly. These validations are automatically being run by the TripleO UI, but there is no support for the command line workflow and they’re not being exercised by our CI jobs either.
When enabled, TripleO UI runs the validations at the appropriate phase of the planning and deployment. This is done within the TripleO UI codebase and therefore not available to python-tripleoclient or the CI.
The TripleO deployer can run the validations manually, but they need to know at which point to do so and they will need to do it by calling Mistral directly.
This causes a disparity between the command line and GUI experience and complicates the efforts to exercise the validations by the CI.
Each validation already advertises where in the planning/deployment process it should be run. This is under the vars/metagata/groups section. In addition, the tripleo.validations.v1.run_groups Mistral workflow lets us run all validations belonging to a given group.
For each validation group (currently pre-introspection, pre-deployment and post-deployment) we will update the appropriate workflow in tripleo-common to optionally call run_groups.
Each of the workflows above will receive a new Mistral input called run_validations. It will be a boolean value that indicates whether the validations ought to be run as part of that workflow or not.
To expose this functionality to the command line user, we will add an option for enabling/disabling validations into python-tripleoclient (which will set the run_validations Mistral input) and a way to show the results of each validation to the screen output.
When the validations are run, they will report their status to Zaqar and any failures will block the deployment. The deployer can disable validations if they wish to proceed despite failures.
One unresolved question is the post-deployment validations. The Heat stack create/update Mistral action is currently asynchronous and we have no way of calling actions after the deployment has finished. Unless we change that, the post-deployment validations may have to be run manually (or via python-tripleoclient).
Document where to run each group and how and leave it at that. This risks that the users already familiar with TripleO may miss the validations or that they won’t bother.
We would still need to find a way to run validations in a CI job, though.
Provide subcommands to run validations (and groups of validations) into python-tripleoclient and rely on people running them manually.
This is similar to 1., but provides an easier way of running a validation and getting its result.
Note that this may be a useful addition even if with the proposal outlined in this specification.
Do what the GUI does in python-tripleoclient, too. The client will know when to run which validation and will report the results back.
The drawback is that we’ll need to implement and maintain the same set of rules in two different codebases and have no API to do them. I.e. what the switch to Mistral is supposed to solve.
We will need to modify python-tripleoclient to be able to display the status of validations once they finished. TripleO UI already does this.
The deployers may need to learn about the validations.
Running a validation can take about a minute (this depends on the nature of the validation, e.g. does it check a configuration file or does it need to log in to all compute nodes).
This may can be a concern if we run multiple validations at the same time.
We should be able to run the whole group in parallel. It’s possible we’re already doing that, but this needs to be investigated. Specifically, does with-items run the tasks in sequence or in parallel?
There are also some options that would allow us to speed up the running time of a validation itself, by using common ways of speeding up Ansible playbooks in general:
Since the validations are going to be optional, the deployer can always choose not to run them. On the other hand, any slowdown should ideally outweigh the time spent investigating failed deployments.
We will also document the actual time difference. This information should be readily available from our CI environments, but we should also provide measurements on the bare metal.
Depending on whether the validations will be run by default or not, the only impact should be an option that lets the deployer to run them or not.
The TripleO developers may need to learn about validations, where to find them and how to change them.
Work items or tasks – break the feature up into the things that need to be done to implement it. Those parts might end up being done by different people, but we’re mostly trying to understand the timeline for implementation.
This should make the validations testable in CI. Ideally, we would verify the expected success/failure for the known validations given the CI environment. But having them go through the testing machinery would be a good first step to ensure we don’t break anything.
We will need to document the fact that we have validations, where they live and when and how are they being run.