Add pluggable metrics backend for Ironic and IPA

https://bugs.launchpad.net/ironic/+bug/1526219

This proposes the addition of metric data reporting features to Ironic, and Ironic Python Agent (IPA). Initially, this will include a statsd reference implementation, but will be sufficiently generic to permit the creation of alternative backends.

Problem description

Software metrics are extremely useful to operators for recognizing and diagnosing problems in running software, and can be used to monitor the real time and historical performance of Ironic and IPA in a production environment.

Metrics can be used to determine how quickly (or slowly) parts of the system are running, how often errors (such as API error responses or BMC failures) occur, or the performance impact of a given change.

Currently, neither Ironic nor IPA report any application metrics.

Proposed change

  • Design a shared pluggable metric reporting system.
  • Implement a generic MetricsLogger which includes:
    • Gauges (generic numerical data).
    • Counters (increment/decrement a counter).
    • Timers (time something).
    • Decorators, and context managers for same.
  • Implement a StatsdMetricsLogger as the reference backend [1].
  • Instrument Ironic to report metric data including:
    • Counting and timing of API requests. This may be accomplished by hooking into Pecan.
    • Counting and timing of RPCs.
    • Counting and timing of most worker functions in ConductorManager.
    • Counting and timing of important driver functions.
    • Count and time node state changes. By inspecting provision_updated_at during a state change, the time the node spent in that state can be calculated.
  • Instrument IPA to report metric data including, but not limited to:
    • Image download/write counts and times.
    • Deploy/cleaning counts and times.

Example code follows (based on Python logging module naming conventions):

METRICS = metrics.getLogger(__name__)

class Foo(object):
  def func1(self):
    # Emit gauge metric with value 1
    METRICS.send_gauge("one.fish", 1)

    # Increment counter metric by two
    METRICS.send_counter("two.fish", 2)

    # Decrement counter metric by one
    METRICS.send_counter("red.fish", -1)

    # Randomly sample the data (emit metric 10% of the time)
    METRICS.send_counter("blue.fish", 42, sample_rate=0.1)

    # Emit a timer metric with value of 125 (milliseconds)
    METRICS.send_timer("black.fish", 125)

    # Randomly sample the data (emit metric 1% of the time)
    METRICS.send_timer("blue.fish", 125, sample_rate=0.01)

  @METRICS.counter("func2.count")
  @METRICS.timer("func2.time", sample_rate=0.1)
  def func2(self):
    pass

  # Context managers for counting and timing code blocks
  def func3(self):

    with METRICS.counter("func3.thing_one.count", sample_rate=0.25):
      thing_one()

    with METRICS.timer("func3.thing_two.time"):
      thing_two()

Metric names follow this convention (optional parts indicated by []):

[global_prefix.][host_name.]prefix.metric_name

If –metrics-agent-prepend-host-reverse is set, then host.example.com becomes com.example.host to assist with hierarchical data representation.

For example, using the Statsd backend, and relevant config options, METRICS.send_timer("blue.fish", 125, sample_rate=0.25) is emitted to statsd as globalprefix.com.example.host.moduleprefix.blue.fish:1|ms@0.25.

Alternatives

Alternatively, we could implement a Ceilometer backend. Although Ironic already reports some measurements (such as IPMI sensor data) to Ceilometer, the metrics that are proposed in this spec do not fit with the Ceilometer project mission, which is to ”...collect measurements of the utilization of the physical and virtual resources comprising deployed clouds...” [2]

Instead, this spec proposes that we instrument parts of the Ironic/IPA codebase itself to report metrics and statistics about how/when the code is run, and the performance of the code thereof. This data is not directly related to “physical and virtual resources comprising deployed clouds.” Therefore, we do not propose the addition of a Ceilometer backend, nor do we propose that the existing Ceilometer measurements be converted to this system, as they represent fundamentally different types of data.

Data model impact

None

State Machine Impact

None.

REST API impact

To support agent drivers, a config field will be added to the response for the /drivers/<drivername>/vendor_passthru/lookup endpoint in the Ironic API.

This field will contain the agent-related config options that an agent can use to configure itself to report metric data. For example: statsd host and statsd port.

Client (CLI) impact

None.

RPC API impact

None.

Driver API impact

None.

Nova driver impact

None.

Ramdisk impact

N/A

Security impact

The statsd daemon [3] has no authentication, and consequently anyone who is able to send UDP datagrams to the daemon can send arbitrary metric data. However, the statsd daemon is typically configured to listen only on a local interface, which partially mitigates security concerns.

Other end user impact

None.

Scalability impact

Deployers must ensure that their statsd infrastructure is scaled correctly relative to the size of their deployment. However, even if the statsd daemon is overloaded, Ironic will not be negatively affected (statsd UDP datagrams are non-blocking, and will simply not be processed).

Performance Impact

By default, metrics reporting will be disabled, reducing, but not totally eliminating, the performance impact for users who do not wish to collect metrics. At the very least, a conditional must be checked at each place where a metric could be reported. Furthermore, depending on exactly how and where the conditional checking occurs, arguments may be evaluated even if the metric data aren’t actually sent.

Reporting metrics via statsd affects performance minimally. The overhead of sending a single piece of metric data is very small–in particular, statsd metrics are sent via UDP (non-blocking) to a daemon [2] that aggregates the metrics before forwarding them to one of its supported backends. Should this backend become unresponsive or overloaded, then metric data will be lost, but without other performance effects.

After the metric data are aggregated by a local statsd daemon, they are periodically flushed to one of statsd’s configured backends, usually Graphite [4].

Other deployer impact

There are two different sets of configuration options to be added:

These options will be set in the ironic-lib metrics library, and will be used by any ironic service implementing metrics:

[metrics]

# Backend options are "statsd" and "noop"
backend="noop"
statsd_host="localhost"
statsd_port=8125

# See proposed changes section for detailed description of how these are used
prepend_host=false
prepend_host_reverse=false
global_prefix=""

Additionally, the following options will be added to the ironic-conductor and used to configure the ironic-python-agent for metrics on lookup:

# Backend options are "statsd" and "noop"
agent_backend="noop"
agent_statsd_host="localhost"
agent_statsd_port=8125

# See proposed changes section for detailed description of how these are used
agent_prepend_host=false
agent_prepend_host_reverse=false
agent_prepend_uuid=false
agent_global_prefix=""

If the statsd metrics backend is enabled, then deployers must install and configure statsd, as well as any other metrics software that they wish to use (such as Graphite [3]). Additionally, if deployers wish to emit metrics from ironic-python-agent as well, the statsd backend must be accessible from networks that agents run on.

Developer impact

None.

Implementation

Assignee(s)

Primary assignee:
alineb
Other contributors:
JayF

Work Items

  • Design/implement metric reporting into ironic-lib.
  • Implement statsd backend.
  • Instrument Ironic code to report metrics.
  • Instrument IPA code to report metrics.

Dependencies

None.

Testing

Additional care may be required to test the statsd network code.

Upgrades and Backwards Compatibility

None.

Documentation Impact

Appropriate documentation must be written.