Copyright (c) 2017 IBM

This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode

Zuul v3 Executor Security

Storyboard: https://storyboard.openstack.org/#!/story/2000910

Playbooks provided in project repos are already run with a set of Ansible plugins to protect the executor from compromise or information leaks. While this belt is keeping our security pants on, we definitely don’t want them to fall down if the belt fails, so we need suspenders in the form of OS level containment.

The goals of this effort as as follows:

  • Define simple, automated ways for Zuul to protect its own executor.
  • Provide operators with guidance on Executor security measures.
  • Keep zuul simple.

Note that we will not discuss any methods to mitigate resource exhaustion outside the executor, such as filling up Swift with artifacts, using nodes for purposes outside the ToS agreed upon the by Zuul operator, etc.

Problem Description

If a bug in Ansible or our Ansible plugins allows users to break out of the insecure context, the executor will currently be vulnerable to several known attack vectors.

Local Privilege Escalation (LPE)

The executor runs as an unprivileged daemon user. It will run ansible-playbook with those same privileges. While administrators should lock this user down to the minimum amount of access required to launch jobs, Linux and other operating systems which might run Zuul are not immune from privilege escalation vulnerabilities.

Critical Information Leaks (CIL)

Systems which are not generally secured against local users may provide helpful information to malicious actors. This includes simple things like operating system kernel versions, networking configuration, and more critical information like files containing secrets that are accidentally exposed by incorrect local file permissions.

Denial of Service (DoS)

A bad actor that breaks out of Ansible protections may be able to do some very small things to consume all of the resources of the executor.

Proposed Change

Execution Flow

Currently the executor functions like this, with untrusted context being secured by Ansible plugins. Any of the playbooks may be run in a trusted or untrusted context depending on whether or not they were defined in a project repo or in a config repo.

  1. Make a writable job dir
  2. Copy merged git repos to job dir
  3. Run pre playbooks
  4. Run in-repo playbooks
  5. Run post playbooks
  6. Nuke job dir

Two possible revisions to this are “Secure Execution on Executor” and “Secure Execution on Node”:

Secure Execution on Executor

We can rework the untrusted context by wrapping it in containment methods, outlined below. If we choose containment methods that require an image we’ll add two steps before step (1) above:

-1. Build image for chroot periodically. 0. Copy image to job dir 1. Make a writable job dir 2. Copy merged git repos to job dir 3. Run pre playbooks 4. Run in-repo playbooks 5. Run post playbooks 6. Nuke job dir

Step (-1) above could be done out of band with the executor using diskimage-builder. It’s possible nodepool-builder could be used here, but for an initial implementation, I believe this can simply be done on a cron once a day and configured as a path to a template image directory or tarball. Any CoW system is an implementation/optimization detail to speed up the copy step and reduce disk footprint.

A more practical plan is to skip (-1) and (0) and just bind mount /usr into the working directory, which is already the default method used by Bubblewrap. This means that a contained attacker has access to all the tools from the executor host, but in terms of real security, depriving them of gcc, while we allow running python and contacting the internet, is only a very minor hurdle for an attacker to get over. We may want to advise deployers to keep the executor host software footprint to a minimum as a result.

Diskspace monitoring

Because playbooks will need to transfer artifacts around, we will need to monitor artifact space usage by playbooks. While usage of an object storage service like Swift is also an option, there will always be some percentage of space needed on the executor for playbooks to use as scratch space, and we don’t want to require object storage services for effective use of Zuul.

A simple method will be to have a single process/thread which walks playbook artifact storage with du periodically. Any job that has exceeded its space allocation will be terminated immediately and have its artifact space emptied. A very fast consumer of space will be able to fill a disk before this can be done, so the limit should be relatively low in comparison to the size of the storage area.

A config option will be created to define the per-job disk space limit for all jobs. This should simplify the initial implementation, but later on it may be necessary to define per-job space limitations.

Evaluation of methods of containment will assume that this change precedes or accompanies any implementation.

Available Containment Methods

There are a number of different options available to address executor security.

Some known methods are listed below with general background information, including a list of pros and cons for each.

Many of these can be combined, some cannot. It seems likely that the end solution will have us adopting at least 2. We may also need to add in a layer of abstraction to Zuul to allow users to write their own security integrations based on their knowledge and abilities, but that is beyond the scope of this document.

ulimit

This limits what resources a user-space process can consume.

LPE

No coverage.

CIL

No coverage.

DoS
  • Can prevent exhaustion of user-space memory
  • Can prevent direct exhaustion of process space
  • Still vulnerable to exhaustion of kernel structures and I/O
Pros
  • Simple implementation
  • No filesystem changes needed
  • Built-in to all operating systems.
  • No performance overhead
Cons
  • Only covers a few DoS vectors and nothing else

Chroot

This would involve building a directory with only the binaries needed to run playbooks, source trees bind mounted or copied in, and writable space for artifacts.

Special care would be taken to ensure the binary paths were readonly and any writable paths are mounted noexec.

LPE
  • Mitigates due to removal of most binaries [binaries]
  • Mitigates due to removal of access to directories outside chroot.
  • Vulnerable to kernel problems which allow chroot breakout or privilege escalation via Python.
CIL
  • Mitigates due to removal of most binaries [binaries]
  • Mitigates due to removal of access to directories outside chroot.
  • Still vulnerable to any kernel<->user space interaction which Python can do natively.
[binaries](1, 2) This mitigation is complicated by the fact that an attacker could build binaries on a test node and transfer it back as an artifact. Getting permissions and noexec parts right would be key.
DoS
  • No significant improvement.
Pros
  • Simple, built-in to most operating systems
  • Well understood, can be fully achieved by unprivileged user.
Cons
  • Incomplete coverage
  • Known attack vectors
  • Requires building chroot filesystem carefully.

Cgroups

Cgroups allow one to limit a set of processes’ access to various kernel subsystems, and to identify them as a group.

Various helpers exist for them, and those will be evaluated separately to the fundamental cgroup capability.

The implementation would be to create a cgroup for each ansible-playbook execution, with the administrator being able to decide the template for that cgroup.

LPE
  • Mitigates somewhat by restricting access to some kernel subsystems.
CIL
  • Mitigates somewhat by restricting access to some kernel subsystems.
DoS
  • Significant mitigation due to limitations on all kernel subsystems.
  • Provides convenient way to integrate with du process as any detected overrun of disk space can have its cgroup ‘frozen’ stopping all processes in the cgroup.
  • Controls “noisy neighbor” by guaranteeing even consumption of CPU and IO.
Pros
  • Relatively simple to create and modify cgroups
Cons
  • Direct cgroup manipulation requires root privileges or setuid helper

Seccomp

Seccomp is a system by which a process may restrict what syscalls it, and any of its children, may make. It is a relatively straightforward process to consider what syscalls Ansible would need to make, since its primary functions are local file CRUD, and network operations.

LPE
  • Reduces attack surface of the kernel by limiting to the needed syscalls.
  • Reduces ability of python to do real damage beyond what the needed syscalls can do.
CIL
  • Should reduce surface area again by limiting access to syscalls which leak information.
DoS
  • Same mitigations as LPE.
Pros
  • Well understood, universally available Linux security technology.
  • The syscall-oriented nature means it’s likely the set of syscalls needed will remain relatively static, reducing maintenance load as new versions of Ansible are released.
Cons
  • Tooling is a bit obtuse and user-unfriendly.

LXC

An LXC container is effectively a combination of chroot, cgroup, and Linux kernel namespaces.

A potential implementation would be to build a chroot filesystem using diskimage-builder and then launch an LXC container with that as the root filesystem, and bind mounts for readonly data (git trees) and writable space (artifacts).

LPE
  • Mitigates a bit more than Cgroup+Chroot by preventing crossing user namespace boundaries.
CIL
  • Mitigates a few more leaks by further partitioning processes access to data in the kernel that may belong to other processes.
DoS
  • No better than cgroups + chroot.
Pros
  • Simpler implementation than Docker
  • Well understood and mature set of technologies
Cons
  • Less popular than Docker, risk it being abandoned
  • Single-vendor open source project (Canonical) makes this problematic for Zuul deployers on not-Ubuntu/Debian.
  • Still requires careful filesystem and mount crafting.

Docker

Docker started life as a daemon to control LXC, just like LXC 2.0 is now. It has grown quite a bit from there and provides all of the same LPE/CIL/DoS protections as LXC.

In addition to the LXC capabilities, it features a rich set of image build tools, and a daemon for storing and retrieving those called ‘docker hub’. There is also a centralized internet Docker Hub where users share their container images.

Pros
  • Industry wide attention means support and adoption will be less controversial.
  • Includes container storage limits as a feature, possibly mitigating the need for the du storage monitoring thread, or at least providing extra protection against the race condition.
Cons
  • A mountain of features which we don’t need means it is far more complex than needed. The net effect of downtime and confusion for operators of Zuul may not be worth the security mitigations.

rkt

Rkt is aimed at those who do feel that Docker is overkill for containing things. It mostly sits as an abstraction for containment of things, with systemd-nspawn and kvm available. It provides all the same LPE/CIO/DoS protections as LXC.

Pros
  • Well thought out design that tries only to do one thing well
Cons
  • Single-vendor
  • Unknown how well tested it is

Bubblewrap

https://github.com/projectatomic/bubblewrap

Bubblewrap is similar to Docker or LXC, except that it may not require root privleges to sandbox an application. It is also aimed specifically at sandboxing rather than providing image based isolation like LXC and Docker. It would be used similar to LXC or Docker, and provide around the same level of mitigation for LPE/CIL/DoS.

Pros
  • Small simple command line utility with no privileged daemons necessary.
  • Specifically built for sandboxing partially trusted apps only.
  • Supports Seccomp
Cons
  • User space is not included in Ubuntu 16.04 (Backporting is trivial).
  • Kernel on Ubuntu 16.04 is limited, Yakkety backport is required to get full set of USER_NS features.
  • The kernel side is relatively new and untested, and has already had a few local root exploits found in it.

systemd-nspawn

Similar to bubblewrap, but coming from the systemd project. It does have some unprivileged capabilities, but I believe for our use case we would need it to be setuid or run as root.

Its containment capabilities are comparable to Bubblewrap.

Pros
  • It can take advantage of Btrfs or LVM for CoW snapshots automatically, which is nice for scaling to lots of concurrent jobs.
Cons
  • Confusing relationship with systemd and machined.
  • Seems focused on running a whole OS rather than an app.

AppArmor

AppArmor is a relatively straight forward kernel security module that allows defining the behavior of individual binaries. Combined with chroot, this could be enough to mitigate most vulnerabilities.

LPE
  • Mitigates further by reducing surface area in the kernel and userspace
CIL
  • Mitigates further by reducing surface area in the kernel and userspace
DoS
  • No significant improvement.
Pros
  • Extremely Simple profile language adds value without confusing admins too much.
Cons
  • Not supported on CentOS/Fedora/RHEL
  • Having AppArmor enforcing can complicate things if packages have defined AppArmor profiles that do not agree with how the executor wants to use those packages.

SELinux

SELinux is similar to AppArmor, but can offer more fine-grained control and thus more complete protection, at the cost of more complexity and thus a more difficult implementation. It has more or less the same LPE/CIL/DoS profile as AppArmor.

Pros
  • Extremely powerful tools allow extremely fine-grained control
  • Specifically limits chroot and/or container breakouts with the combination of process contexts and MCS (Multi-Category-Security)
Cons
  • Having SELinux enforcing means the whole executor system must have its SELinux configuration fully defined.

Recommendation

Based on the surface level evaluations, I believe Bubblewrap has the highest value for the lowest complexity. We can use it with the /usr from the executor bind mounted into the chroot, which is slightly less secure than managing our own overlays and images since we may end up with dangerous setuid binaries accessible to users. We are already building working directories for jobs so putting a chroot in there doesn’t seem like too far of a departure.

Bubblewrap can be used via setuid on Ubuntu 16.04 (via backports) without upgrading to a Yakkety kernel. It allows us to get a ton of containment without sacrificing much in the way of complexity. We can combine it with cgroups later to increase DoS protection once we have it containing the process. We can also add SELinux support fairly easily once this is known to work. Finally we can layer on seccomp and reduce surface area even further.

Building images for the chroot with minimal binaries would reduce surface area further, but this can be deferred until we have full container/COE support for testing nodes. This way we can keep image building where it is now, in Nodepool.

Alternatives

Secure Execution on a Test Node

Alternatively, we could rely on Ansible in the node, and keep the flow as-is, but make the untrusted context mean “inside a node”. In order to do that we would need to make one of the nodes an “untrusted executor” (simplest answer on which one to use is the first one in the node set). This would involve the following changes:

  • Build custom inventory
    • An inventory would need to have the untrusted executor setup specially so that it uses ansible_connection=local, or it would need to be able to SSH to itself.
  • Create and distribute creds
    • The untrusted executor would need an ephemeral private SSH key, and all other nodes in the nodeset would need this key installed.
  • Network Access
    • Currently we verify that nodepool -> nodes works, and assume executor -> nodes is equivalent. But this would require that we be able to SSH from node to node, which may not always be possible. We also likely will want to make sure inventories have the private IP.
  • Ansible setup on untrusted executor
    • We currently don’t put any restrictions on nodes other than the ability to SSH into them. We’d need to install ansible somehow, possibly in a chroot to keep it isolated from the user’s test execution and dependencies. Isolating Ansible in this way should be quite a bit simpler than isolating Ansible in a security context though.
Pros
  • Same containment for executor as tests mean we could probably just drop the Ansible plugins.
  • Executor scales with test nodes
Cons
  • Ansible must be injected or present in all test nodes.
    • Injection is brittle, requiring extra download and build steps that add failure risk to test runs, potentially wasting resources.
    • Requiring Ansible to be present is a burden for those who want to take advantage of the fact that Zuul and nodepool allow custom images.
    • Ansible’s requirements are non-trivial, so if we can’t spare more test nodes for an executor-specific Ansible, at the very least we would need to inject a virtualenv or chroot to run Ansible in, contaminating the test nodes’ environment.
  • Resources normally allocated to running tests will be consumed by executor, or nodes will need to be allocated to running playbooks only.

Ultimately, this method is rejected for both of the Cons above. The Ansible plugin should provide medium level security, and a healthy dose of namespaces, cgroups, and chroot should keep any breakouts contained.

Implementation

Assignee(s)

Primary assignee:
  • SpamapS

Work Items

  • Request backport of bubblewrap userspace from latest Ubuntu stable to xenial-backports.
  • Create ansible minimal chroot image.
  • Add chroot-copy into job dir before insecure contexts.
  • Add code to call ansible-playbook via bwrap in the insecure context.

Repositories

openstack-infra/zuul (feature/zuulv3)

Servers

N/A

DNS Entries

N/A

Documentation

We will need to write heavy documentation outlining not only how to setup a executor, but what risks are still present.

Security

This spec is entirely focused on enhancing the process for securing Zuul v3.

Testing

Integration tests will need to be configured with the mitigation technologies we implement.

Dependencies

zuulv3