The OSprofiler is a distributed trace toolkit library. It provides pythonic helpers to do trace generation to avoid repeated code to trace WSGI, RPC, DB, and other important places... It also provides an interfaces for various collection backends and helpers to parse data in and out of it.
OpenStack consists of multiple projects. Each project, in turn, is composed of multiple services. To process some request, e.g. to boot a virtual machine, OpenStack uses multiple services from different projects. In the case something works too slowly, it’s extremely complicated to understand what exactly goes wrong and to locate the bottleneck.
To resolve this issue, we introduce a tiny but powerful library, OSprofiler, that could be used by all OpenStack projects and their python clients. To be able to generate a single trace per request, that goes through all involved services, and builds a tree of calls (see an example).
For more details about how exactly OSprofiler works take a look at readme
OSprofiler is already used by Cinder, Heat, Glance and it is going to be used by most of other projects like:
Since currently OSprofiler is not under OpenStack umbrella in governance https://github.com/openstack/governance/blob/master/reference/projects.yaml we are proposing the Oslo program should be home for OSprofiler.
Further just like other oslo projects, we should have a core team just for OSprofiler as well.
How OSprofiler works *
OSprofiler is very tiny library that allows you to create nested traced. Basically it just keeps in memory stack (list) of elements.
Each element has 3 trace id:
- base_id - all points of the same trace have same base_id, which helps
us to fetch all points related to some trace
parent_id - parent’s point id
current_id - current point’s id
And it has 2 methods start() and stop(), start() is pushing new elements and calling driver’s notify method with payload, stop() is removing latest element from stack and calling one more notify()
This allows us to build tree of calls with durations and payload info: Like here.
For more details please read the docs.
What is going to be used as a OSprofiler driver (trace collector) *
- OSprofiler is going to have multi drivers support. Which means that basically any centralized system can be used to collect data. Or even we can write trace information just to files.
- As a first driver we are going to use oslo.messaging & Ceilometer
- In future OSprofiler team is going to add drivers for: MongoDB, InfluxDB, ElasticSearch
OSprofiler integration points: *
Changes in projects configuration:
Add config group “profiler” and 3 config options inside (in all projects):
enabled = False # Fully disable OSprofiler by default
# by default, because there are too many DB requests # and tracing them is useful only for deep debugging.
# that activate OSprofiler, this is used to # block regular users to trigger OSProfiler. # They must be the same for all projects.
# OSProfiler driver to use and credentials # for specified backend. Like # mongo://user:password@ip:port/schema
Keep single trace between 2 projects:
Add OSprofiler middleware to all pipelines in all paste.ini of all projects
This middleware is initializing OSProfiler if there is special HTTP header signed with proper HMAC key.
Keep single trace between 2 services of single project:
If RPC caller has initialized profiler it should add special payload to all RPC messages that contains trace information. Callee process should initialize OSProfiler if it found such message.
Changes required in python clients
There are 2 changes required in each python client:
What points should be tracked by default?
I think that for all projects we should include by default 5 kinds of points:
- All HTTP calls - helps to get information about: what HTTP requests were done, duration of calls (latency of service), information about projects involved in request.
- All RPC calls - helps to understand duration of parts of request related to different services in one project. This information is essential to understand which service produce the bottleneck.
- All DB API calls - in some cases slow DB query can produce bottleneck. So it’s quite useful to track how much time request spend in DB layer.
- All driver layer calls - in case of nova, cinder and others we have vendor drivers. (e.g. nova.virt.driver)
- All DB API layer calls (e.g. nova.db.api)
- All raw SQL requests (turned off by default, because it produce a lot of traffic)
- Any other import for specific project methods/classes/code pieces
** Points that should be tracked in future as well:**
- All external commands. For example, oslo.concurrency processutils.execute() calls. Because the work done by external commands is a key part of many backend API implementations and takes non-trivial time.
- Worker threads spawned / run. Some API calls will result in single-use background threads being spawned to process some work asynchronously from the rest of the work. I think it is important to be able to capture this work in traces, by recording a trace when a thread start is requested, and then having a trace in the start+end of the thread main method.
Why not cProfile and other python tracer/debugger?
The scope of this library is quite different:
What about Zipkin and other existing distributed tracing solutions?
OSprofiler is small library that is used to provide no vendor lock-in tracing solution for OpenStack.
OSprofiler doesn’t intend to implement whole Zipkin like stack. It’s just tiny library that is used integrate OpenStack with different collectors and provide native OpenStack tracer/profiler that is not hard coded on any tracing service (e.g. Zipkin).
OSprofiler is using HMAC to sign trace headers. Only the people who know the secret key used for HMAC are able to trigger profiler. As HMAC is quite secure there won’t be issues with security.
Even in worse case, when attacker knows secret key, he will be able to trigger profiler, that will make his request a bit slower, but won’t affect other users.
If it is turned off there is negligible performance overhead. Just couple of “if None” checks
If it is turned on, there are two different cases:
Trace every N request configuration (not done yet)
In such configuration every N request will have OSprofiler overhead that depends on many things: Amount of traced points (depends on API method), OSprofiler backend, and other factors..
We are adding to all projects new CONF group options:
[profiler] #If False fully disable profiling feature. #enabled = False
# If False doesn’t trace SQL requests. #trace_db = True
# HMAC keys that are used to sign headers. # Because OpenStack contains many projects we are not able to update all HMAC # keys at the same point of time. To provide ability of no downtime rolling # updates of HMAC keys (security reasons) we need ability to specify many HMACs # The process of update OLDKey -> NEWKey will look like: # 1) Initial system configuration: # ALL Services have OLDKey, users use OLDKey # 2) in the middle 1: # All Services have OLDKey, part of them have both OLDKey and NEWKey, users # use OLDKey # 3) in the middle 2: # All Services have OLDKey and NEWKey, users use NEWKey # 4) in the middle 3: # Part of service have both keys and some of services have only NEWKey, # users use NEWKey # 5) end system configuration: # All services have only NewKey #hmacs = SECRET_KEY1, SECRET_KEY2
# Profiler driver collector connection string. connection = None
By default OSprofiler is turned off. However it can be keep on in production, because it doesn’t add any overhead until it is triggered and profiler can be trigged only by person who knows HMAC key.
Developers will be able to profile OpenStack and fix different issues related to performance and scale.
All projects and python clients should add quite small amount of code to make it possible to do the cross project/service tracing.
We should document in one place how to configure and use OSprofiler.
This work is licensed under a Creative Commons Attribution 3.0 Unported License. http://creativecommons.org/licenses/by/3.0/legalcode