Global Request IDs¶
https://blueprints.launchpad.net/oslo?searchtext=global-req-id
Building a complex resource, like a boot instance, requires not only touching a number of Nova processes, but also other services such as Neutron, Glance, and possibly Cinder. When we make those service jumps we currently generate a new request-id, which makes tracing those flows quite manual.
Problem description¶
When a user creates a resource, such as a server, they are given a
request-id back. This is generated very early in the paste pipeline of
most services. It is eventually embedded into the context
, which
is then used implicitly for logging all activities related to the
request. This works well for tracing requests inside of a single
service as it passes through its workers, but breaks down when an
operation spans multiple services. A common example of this is a
server build, which requires Nova to call out multiple times to
Neutron and Glance (and possibly other services) to create a server on
the network.
It is extremely common for clouds to have an ELK (Elastic Search, Logstash, Kibana) infrastructure that is consuming their logs. The only way to query these flows is if there is a common identifier across all relevant messages. A global request-id immediately makes existing deployed tooling better for managing OpenStack.
Proposed change¶
The high level solution is as follows (details on specific points later):
accept an inbound X-OpenStack-Request-ID header on requests. Require that it looks like a uuid to prevent injection issues. Set this to the value of
global_request_id
Keep the auto generated existing request_id
update oslo.log to default also log
global_request_id
when it is in a context logging mode.
Paste pipelines¶
The processing of incoming requests happens piecemeal through the set of paste pipelines. These are mostly common between projects, but there are enough local variation to highlight what this looks like for the base IaaS services, which will be the initial targets of this spec.
Neutron [1]¶
[composite:neutronapi_v2_0]
use = call:neutron.auth:pipeline_factory
noauth = cors http_proxy_to_wsgi request_id catch_errors extensions neutronapiapp_v2_0
keystone = cors http_proxy_to_wsgi request_id catch_errors authtoken keystonecontext extensions neutronapiapp_v2_0
# ^ ^
# request_id generated here -------+ |
# context built here ------------------------------------------------+
Glance [2]¶
# Use this pipeline for keystone auth
[pipeline:glance-api-keystone]
pipeline = cors healthcheck http_proxy_to_wsgi versionnegotiation osprofiler authtoken context rootapp
# ^
# request_id & context built here -----------------------------------------------------+
Cinder [3]¶
[composite:openstack_volume_api_v3]
use = call:cinder.api.middleware.auth:pipeline_factory
noauth = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler noauth apiv3
keystone = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler authtoken keystonecontext apiv3
# ^ ^
# request_id generated here -------+ |
# context built here ------------------------------------------------------------------+
Nova [4]¶
[composite:openstack_compute_api_v21]
use = call:nova.api.auth:pipeline_factory_v21
noauth2 = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler noauth2 osapi_compute_app_v21
keystone = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler authtoken keystonecontext osapi_compute_app_v21
# ^ ^
# request_id generated here -------+ |
# context built here ----------------------------------------------------------------------+
oslo.middleware.request_id¶
In nearly all services the request_id generation happens very early, well before any local logic. The middleware sets an X-OpenStack-Request-ID response header, as well as variables in the environment that are later consumed by oslo.context.
We would accept an inbound X-OpenStack-Request-ID, and validate that
it looked like req-$UUID
before accepting it as the
global_request_id
.
The returned X-OpenStack-Request-ID would be the existing
request_id
. This is like the parent process getting the child
process id on a fork() call.
oslo.context from_environ¶
Fortunately for us most projects now use the oslo.context
from_environ
constructor. This means that we can add content to
the context, or adjust the context, without needing to change every
project. For instance in Glance the context constructor looks like
[5]:
kwargs = {
'owner_is_tenant': CONF.owner_is_tenant,
'service_catalog': service_catalog,
'policy_enforcer': self.policy_enforcer,
'request_id': request_id,
}
ctxt = glance.context.RequestContext.from_environ(req.environ,
**kwargs)
As all logging happens after the context is built. All required parts of the context will be there before logging starts.
oslo.log¶
oslo.log defaults should include global_request_id
during context
logging. This is something which can be done late, as users can always
override there context logging string format.
projects and clients¶
With the infrastructure above implemented it will be a small change to
python clients to save and emit the global_request_id
when
created. For instance, Nova calling Neutron, during the get_client
call context.request_id
would be stored in the client. [6]:
def _get_available_networks(self, context, project_id,
net_ids=None, neutron=None,
auto_allocate=False):
"""Return a network list available for the tenant.
The list contains networks owned by the tenant and public networks.
If net_ids specified, it searches networks with requested IDs only.
"""
if not neutron:
neutron = get_client(context)
if net_ids:
# If user has specified to attach instance only to specific
# networks then only add these to **search_opts. This search will
# also include 'shared' networks.
search_opts = {'id': net_ids}
nets = neutron.list_networks(**search_opts).get('networks', [])
else:
# (1) Retrieve non-public network list owned by the tenant.
search_opts = {'tenant_id': project_id, 'shared': False}
if auto_allocate:
# The auto-allocated-topology extension may create complex
# network topologies and it does so in a non-transactional
# fashion. Therefore API users may be exposed to resources that
# are transient or partially built. A client should use
# resources that are meant to be ready and this can be done by
# checking their admin_state_up flag.
search_opts['admin_state_up'] = True
nets = neutron.list_networks(**search_opts).get('networks', [])
# (2) Retrieve public network list.
search_opts = {'shared': True}
nets += neutron.list_networks(**search_opts).get('networks', [])
_ensure_requested_network_ordering(
lambda x: x['id'],
nets,
net_ids)
return nets
Note
There are some usage patterns where a client is built and kept for long running operations. In these cases we’d want to change the model to assume that clients are ephemeral, and should be discarded at the end of their flows.
This will also help tracking non user initiated tasks such as periodic jobs that touch other services for information refresh.
Alternatives¶
Log in the Caller¶
There was a previous OpenStack cross project spec to completely handle this in the caller - https://review.openstack.org/#/c/156508/. That was merged over 2 years ago, but has yet to gain traction.
It had a number of disadvantages. It turns out the client code is far less standardized here, so fixing every client was substantial work.
It also requires some standard convention for writing these things out to logs on the caller side that is consistent between all services.
It also does not allow people to use Elastic Search to trace their logs (which all large sites have running). A custom piece of analysis tooling would need to be built.
Verify trust in callers¶
A long time ago, in a galaxy far far away, in a summit room I was not in, I was told there was a concern about clients flooding this field. There has been no documented attack that seems feasable here if we strictly validate the inbound data.
There is a way we could use Service roles to validate trust here, but without a compelling case for why that is needed, we should do the simpler thing.
For reference Glance already accepts a user provided request-id of 64 characters or less. This has existed there for a long time, with no reports as to yet for abuse. We could consider dropping the last constraint and not doing role validation.
Swift multipart transaction id¶
Swift has a related approach where their transaction id, which is a multipart id that includes a piece generated by the server on inbound request, a timestamp piece, a fixed server piece (for tracking multiple clusters), and a user provided piece. Swift is not currently using any of the above oslo infrastructure, and targets syslog as their primary logging mechanism.
While there are interesting bits in this approach, it’s a less straight forward chunk of work to transition to, given the oslo components. Also, oslo.log has many structured log back ends (like json stream, fluentd, and systemd journal) where we really would want the global and local as separate fields so there is no heuristic parsing required.
Impact on Existing APIs¶
oslo.middleware request_id contract will change so that it accepts an inbound header, and sets a second env variable. Both are backwards compatible.
oslo.context will accept a new local_request_id. This requires plumbing local_request_id into all calls that take request_id. This looks fully backwards compatible.
oslo.log will need to be adjusted to support logging both request_ids. It should probably be enabled to do that by default, though log_context string is a user configured variable, so they can set whatever site local format works for them. An upgrade release note would be appropriate when this happens.
Security impact¶
There previously was a concern about trusting request ids from the user. It is an inbound piece of user data, so care should be taken.
Ensure it is not allowed to be so big as to create a DOS vector (size validation)
Ensure that it is not a possible code injection vector (strict validation)
These items can be handled with strict validation of the content that it looks like a valid uuid.
Performance Impact¶
Minimal. This is a few extra lines of instruction in existing through paths. No expensive activity is done in this new code.
Configuration Impact¶
The only configuration impact will be on the oslo.log context string.
Developer Impact¶
Developers will now have much easier tracing of build requests in their devstack environments!
Testing Impact¶
Unit tests provided with various oslo components.
Implementation¶
Assignee(s)¶
- Primary assignee:
sdague
- Other contributors:
None
Note
Could definitely use help to get this through the gauntlet, there are lots of little patches here to get right.
Milestones¶
Target Milestone for completion: TBD
Work Items¶
TBD
Documentation Impact¶
TBD - but presumably some updates to operators guide on tracing across services.
Dependencies¶
None
References¶
Note
This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode