Global Request IDs

https://blueprints.launchpad.net/oslo?searchtext=global-req-id

Building a complex resource, like a boot instance, requires not only touching a number of Nova processes, but also other services such as Neutron, Glance, and possibly Cinder. When we make those service jumps we currently generate a new request-id, which makes tracing those flows quite manual.

Problem description

When a user creates a resource, such as a server, they are given a request-id back. This is generated very early in the paste pipeline of most services. It is eventually embedded into the context, which is then used implicitly for logging all activities related to the request. This works well for tracing requests inside of a single service as it passes through its workers, but breaks down when an operation spans multiple services. A common example of this is a server build, which requires Nova to call out multiple times to Neutron and Glance (and possibly other services) to create a server on the network.

It is extremely common for clouds to have an ELK (Elastic Search, Logstash, Kibana) infrastructure that is consuming their logs. The only way to query these flows is if there is a common identifier across all relevant messages. A global request-id immediately makes existing deployed tooling better for managing OpenStack.

Proposed change

The high level solution is as follows (details on specific points later):

  • accept an inbound X-OpenStack-Request-ID header on requests. Require that it looks like a uuid to prevent injection issues. Set this to the value of global_request_id
  • Keep the auto generated existing request_id
  • update oslo.log to default also log global_request_id when it is in a context logging mode.

Paste pipelines

The processing of incoming requests happens piecemeal through the set of paste pipelines. These are mostly common between projects, but there are enough local variation to highlight what this looks like for the base IaaS services, which will be the initial targets of this spec.

Neutron [1]

[composite:neutronapi_v2_0]
use = call:neutron.auth:pipeline_factory
noauth = cors http_proxy_to_wsgi request_id catch_errors extensions neutronapiapp_v2_0
keystone = cors http_proxy_to_wsgi request_id catch_errors authtoken keystonecontext extensions neutronapiapp_v2_0
#                                  ^                                 ^
# request_id generated here -------+                                 |
# context built here ------------------------------------------------+

Glance [2]

# Use this pipeline for keystone auth
[pipeline:glance-api-keystone]
pipeline = cors healthcheck http_proxy_to_wsgi versionnegotiation osprofiler authtoken context  rootapp
#                                                                                      ^
# request_id & context built here -----------------------------------------------------+

Cinder [3]

[composite:openstack_volume_api_v3]
use = call:cinder.api.middleware.auth:pipeline_factory
noauth = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler noauth apiv3
keystone = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler authtoken keystonecontext apiv3
#                                  ^                                                   ^
# request_id generated here -------+                                                   |
# context built here ------------------------------------------------------------------+

Nova [4]

[composite:openstack_compute_api_v21]
use = call:nova.api.auth:pipeline_factory_v21
noauth2 = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler noauth2 osapi_compute_app_v21
keystone = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler authtoken keystonecontext osapi_compute_app_v21
#                                  ^                                                       ^
# request_id generated here -------+                                                       |
# context built here ----------------------------------------------------------------------+

oslo.middleware.request_id

In nearly all services the request_id generation happens very early, well before any local logic. The middleware sets an X-OpenStack-Request-ID response header, as well as variables in the environment that are later consumed by oslo.context.

We would accept an inbound X-OpenStack-Request-ID, and validate that it looked like req-$UUID before accepting it as the global_request_id.

The returned X-OpenStack-Request-ID would be the existing request_id. This is like the parent process getting the child process id on a fork() call.

oslo.context from_environ

Fortunately for us most projects now use the oslo.context from_environ constructor. This means that we can add content to the context, or adjust the context, without needing to change every project. For instance in Glance the context constructor looks like [5]:

kwargs = {
   'owner_is_tenant': CONF.owner_is_tenant,
   'service_catalog': service_catalog,
   'policy_enforcer': self.policy_enforcer,
   'request_id': request_id,
}

ctxt = glance.context.RequestContext.from_environ(req.environ,
                                                  **kwargs)

As all logging happens after the context is built. All required parts of the context will be there before logging starts.

oslo.log

oslo.log defaults should include global_request_id during context logging. This is something which can be done late, as users can always override there context logging string format.

projects and clients

With the infrastructure above implemented it will be a small change to python clients to save and emit the global_request_id when created. For instance, Nova calling Neutron, during the get_client call context.request_id would be stored in the client. [6]:

def _get_available_networks(self, context, project_id,
                            net_ids=None, neutron=None,
                            auto_allocate=False):
    """Return a network list available for the tenant.
    The list contains networks owned by the tenant and public networks.
    If net_ids specified, it searches networks with requested IDs only.
    """
    if not neutron:
        neutron = get_client(context)

    if net_ids:
        # If user has specified to attach instance only to specific
        # networks then only add these to **search_opts. This search will
        # also include 'shared' networks.
        search_opts = {'id': net_ids}
        nets = neutron.list_networks(**search_opts).get('networks', [])
    else:
        # (1) Retrieve non-public network list owned by the tenant.
        search_opts = {'tenant_id': project_id, 'shared': False}
        if auto_allocate:
            # The auto-allocated-topology extension may create complex
            # network topologies and it does so in a non-transactional
            # fashion. Therefore API users may be exposed to resources that
            # are transient or partially built. A client should use
            # resources that are meant to be ready and this can be done by
            # checking their admin_state_up flag.
            search_opts['admin_state_up'] = True
        nets = neutron.list_networks(**search_opts).get('networks', [])
        # (2) Retrieve public network list.
        search_opts = {'shared': True}
        nets += neutron.list_networks(**search_opts).get('networks', [])

    _ensure_requested_network_ordering(
        lambda x: x['id'],
        nets,
        net_ids)

    return nets

Note

There are some usage patterns where a client is built and kept for long running operations. In these cases we’d want to change the model to assume that clients are ephemeral, and should be discarded at the end of their flows.

This will also help tracking non user initiated tasks such as periodic jobs that touch other services for information refresh.

Alternatives

Log in the Caller

There was a previous OpenStack cross project spec to completely handle this in the caller - https://review.openstack.org/#/c/156508/. That was merged over 2 years ago, but has yet to gain traction.

It had a number of disadvantages. It turns out the client code is far less standardized here, so fixing every client was substantial work.

It also requires some standard convention for writing these things out to logs on the caller side that is consistent between all services.

It also does not allow people to use Elastic Search to trace their logs (which all large sites have running). A custom piece of analysis tooling would need to be built.

Verify trust in callers

A long time ago, in a galaxy far far away, in a summit room I was not in, I was told there was a concern about clients flooding this field. There has been no documented attack that seems feasable here if we strictly validate the inbound data.

There is a way we could use Service roles to validate trust here, but without a compelling case for why that is needed, we should do the simpler thing.

For reference Glance already accepts a user provided request-id of 64 characters or less. This has existed there for a long time, with no reports as to yet for abuse. We could consider dropping the last constraint and not doing role validation.

Swift multipart transaction id

Swift has a related approach where their transaction id, which is a multipart id that includes a piece generated by the server on inbound request, a timestamp piece, a fixed server piece (for tracking multiple clusters), and a user provided piece. Swift is not currently using any of the above oslo infrastructure, and targets syslog as their primary logging mechanism.

While there are interesting bits in this approach, it’s a less straight forward chunk of work to transition to, given the oslo components. Also, oslo.log has many structured log back ends (like json stream, fluentd, and systemd journal) where we really would want the global and local as separate fields so there is no heuristic parsing required.

Impact on Existing APIs

oslo.middleware request_id contract will change so that it accepts an inbound header, and sets a second env variable. Both are backwards compatible.

oslo.context will accept a new local_request_id. This requires plumbing local_request_id into all calls that take request_id. This looks fully backwards compatible.

oslo.log will need to be adjusted to support logging both request_ids. It should probably be enabled to do that by default, though log_context string is a user configured variable, so they can set whatever site local format works for them. An upgrade release note would be appropriate when this happens.

Security impact

There previously was a concern about trusting request ids from the user. It is an inbound piece of user data, so care should be taken.

  • Ensure it is not allowed to be so big as to create a DOS vector (size validation)
  • Ensure that it is not a possible code injection vector (strict validation)

These items can be handled with strict validation of the content that it looks like a valid uuid.

Performance Impact

Minimal. This is a few extra lines of instruction in existing through paths. No expensive activity is done in this new code.

Configuration Impact

The only configuration impact will be on the oslo.log context string.

Developer Impact

Developers will now have much easier tracing of build requests in their devstack environments!

Testing Impact

Unit tests provided with various oslo components.

Implementation

Assignee(s)

Primary assignee:
sdague
Other contributors:
None

Note

Could definitely use help to get this through the gauntlet, there are lots of little patches here to get right.

Milestones

Target Milestone for completion: TBD

Work Items

TBD

Documentation Impact

TBD - but presumably some updates to operators guide on tracing across services.

Dependencies

None