Neutron Resource Diagnostics

RFE: https://bugs.launchpad.net/neutron/+bug/1507499

Problem Description

Cloud software is complex and problems are inevitable. Neutron is no exception to this rule. To cope today we have helper scripts [1], blogs [2], diagnostic tools [3], out-of-tree plugin utilities [4], etc. that could all benefit from additional Neutron diagnostic data APIs. And while we certainly don’t want Neutron to become a new diagnostic system, other projects [5] have found value in exposing some level of diagnostic data for users (typically operators) needing to reach under the hood.

By the same token, Neutron offers a broad array of resources backed by a heterogeneous set of technologies. As a result, standardizing a set of concrete diagnostics across this diverse domain is no simple task. Moreover, a Neutron resource may be realized through multiple components (plugins, agents) spanning multiple nodes (hosts). Thus, any proposal to collect resource diagnostic data must plan accordingly to support multiple, potentially disparate, diagnostic participants.

Proposed Change

This spec proposes to deliver a diagnostic framework consisting of the following high level constructs:

  • Diagnostic checks that are effectively micro-plugins containing concrete logic to carry out a diagnostic check for a given set of resource types (ports, networks, etc.) as well as check specific metadata such as the check’s name, description, etc.
  • Diagnostic providers that are capable of discovering, managing and executing diagnostic checks. These providers are the sources for diagnostics and will allow neutron plugins and agents to plug-and-play with the framework.
  • Diagnostics API extension that supports a /diagnostics URI for existing neutron resources and services the request by discovering and executing checks on applicable providers and rolling up the responses.

Each of the following constructs is discussed in greater detail in the following sections.

Diagnostic Checks

Diagnostic checks are effectively small plugins that provide the following:

  • Check specific metadata including:
    • A unique name for the check.
    • (optional) A description of the check’s diagnostics.
    • A boolean flag indicating if the check is required or not.
    • A set of resource types (ports, networks, etc.) the check is applicable to.
  • The actual diagnostic check logic. This logic will be invoked when collecting diagnostics for a specific resource instance of a resource type that’s supported by the check (as per the check’s metadata). Checks return a well-formed response including a response code, (optional) text message, etc. If a check supports multiple resource types, its implementation can key off the resource type that’ll be passed in by the framework.

As part of this work a sample check will be provided for both a server and agent side plugin that illustrates how to use the framework.

Diagnostic Providers

Diagnostic providers are simply ‘sources’ for diagnostic checks and contain the logic to discover, manage and execute checks. Services in a deployment wishing to contribute diagnostics will therefore implement a diagnostic provider. For this effort we’ll deliver a diagnostic provider binding for both the neutron service as well as neutron based agents.

Diagnostic providers have the following characteristics:

  • The means to discover and load known checks. More details in subsequent sections.
  • An internal cache of loaded checks and public APIs to:
    • Determine if the provider has a check for a given resource type or list of resource types.
    • List the resource types the provider has checks for.
    • Execute all checks applicable to a specified resource type and resource instance, collect the well-formed results and return them.

The neutron server diagnostic provider binding delivered by this effort will be implemented as a service plugin; consumers wishing to invoke it’s APIs can get a reference to the plugin instance via neutron manager. In addition, other plugins can act as a diagnostic provider by supporting the extension and implementing the respective diagnostic methods required. This approach will also work for agent-less plugins and mimics what’s typically done today in such plugins when supporting extensions in their implementation.

The neutron agent based diagnostic provider binding delivered by this implementation will use the agent extension framework; plumbing the support for all neutron based agents. However since the agent binding is remote to the neutron API server it has the following notable differences:

  • To indicate the resource types the agent diagnostic provider supports, a set of (string) resource types is returned in the agent state report and stored in the agent database. On the agent, this set is built by collecting the supported resource types from the checks it manages. On the server side, the agent’s supported resource list can be used by the API extension controller logic to determine if the agent is applicable to service specific diagnostic requests.
  • The public APIs to invoke checks on a specific resource type and resource instance are remoteable so they can be called via RPC.

Diagnostic Providers: Loading Checks

Diagnostic providers are responsible for loading checks upon start-up (e.g. when service plugin starts or the agent starts). There are multiple ways to support check plugins including:

  • Having each check as a stevedore entry point; then loading and managing them via a driver manager.
  • Defining a specific check directory that only contains python modules where each module is a check (plugin) and discovering + importing them from the directory.
  • Statically defining the checks in the diagnostic provider binding code. This is just as it sounds; statically have a list of checks in the code rather than dynamically discovering + loading.

Each approach above has it’s pros/cons. For the initial implementation we suggest using the last approach of statically defining the checks in the code. While this is not the most robust approach, it minimizes complexity and allows us to get a tire-kicker more rapidly that we can then later iterate on once we get some feedback from users.

Diagnostic Providers: Data Model

As mentioned previously, neutron agent based diagnostic providers report the resource types the checks they manage support. This is a list of unique string resource types that is transported in a new key/value of the agent state report. This list is stored in a new database table in a new column called diagnostic_resource_types that maps to its corresponding agent table entry.

None (default) or an empty list signifies the agent does not provide diagnostic data and thus will not be called to service a diagnostic request. Neutron (OVO) objects will be updated as necessary.

API Extension Plugin

The implementation delivered will include an API extension plugin that acts as the REST API controller for diagnostics. As described in the REST API section, a /diagnostics URI is dangled off existing neutron resources that only supports POST with an empty request body. Therefore this controller needs to service diagnostic requests for a specific resource type and ID.

The following pseudo code outlines the controllers logic for request handling:

  • Get a list of all plugin instances from the neutron manager that support the diagnostic extension and filter them to only those that support the said resource type.
  • Get a list of all active agents from the DB and filter to only those that have the said resource type in their diagnostic_resource_types column.
  • From the list of plugin and agent providers that support the given resource type, invoke their API(s) to run the checks they know about for the said resource type and resource instance.
  • Collect the diagnostic results, roll them into a nice response and return them to the caller.

REST API

When enabled, this implementation dangles a /diagnostics URI off Neutron resources. The only supported HTTP method for this URI is POST (with an empty request body) that triggers diagnostic data collection from all applicable registered diagnostic providers. While this spec proposes the diagnostic collection be run synchronously for the initial iteration of the implementation, future workings could make the collection job based run async in the back ground.

The generic form of the diagnostics URL is:

POST /v2.0/{resource}/{resource_id}/diagnostics

The response is a list of diagnostic (dict) objects, one object per diagnostic. A diagnostic is an aspect of the {resource} checked, and contains a description as well as a diagnostic status object and list of individual checks run for the said diagnostic. Diagnostic status is set by the diagnostics framework based on the result of all checks for the said resource diagnostic.

The array of checks returned for each diagnostic includes high level details about the check such as name, description and provider. In addition, checks report their own status based on the result of their check execution. If a check is not successful, the check must return a remediation to describe potential ways to remediate the failed check.

The status at the diagnostic level is handled by the framework and can be one of the following:

  • OK: All checks for the diagnostic are successful.
  • ERROR: One or more checks failed.
  • INACTIVE: One or more providers is inactive and couldn’t be invoked to run the diagnostic checks.
  • DEGRADED: Checks can be registered as optional. If a check is optional and fails, or is INACTIVE, the diagnostic status will be DEGRADED.

For example:

POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
EMPTY POST BODY
==> All successful DHCP diagnostics
{
    "diagnostics": [
        {
            "diagnostic": "dhcp",
            "description": "Neutron network DHCP diagnostics.",
            "status": {
                "code": "DS000",
                "label": "OK",
                "message": "All checks completed successfully."
            },
            "checks": [
                {
                    "name:": "check1",
                    "description": "Check1 does this and that.",
                    "status": {
                        "code": "DS000",
                        "label": "OK",
                        "message": null
                    },
                    "provider": {
                        "name": "DHCP Agent",
                        "host": "dhcp-host1"
                    },
                    "remediation": {}
                },
                {
                    "name:": "check2",
                    "description": "Check2 does cool stuff.",
                    "status": {
                        "code": "DS000",
                        "label": "OK",
                        "message": null,
                    },
                    "provider": {
                        "name": "DHCP Agent",
                        "host": "dhcp-host2"
                    },
                    "remediation": {}
                }
            ]
        }
    ]
}

POST /v2.0/subnets/315ec9bb-34f5-4f7a-a44c-b13015a26803/diagnostics
EMPTY POST BODY
==> A failed dhcp diagnostic
{
    "diagnostics": [
        {
            "diagnostic": "dhcp",
            "description": "Neutron network DHCP diagnostics.",
            "status": {
                "code": "DS002",
                "label": "ERROR",
                "message": "Check 'check1' failed. See the check details for more info."
            },
            "checks": [
                {
                    "name:": "check1",
                    "description": "Check1 does this and that.",
                    "status": {
                        "code": "DHCPE001",
                        "label": "ERROR",
                        "message": "The dnsmasq process is not running."
                    },
                    "provider": {
                        "name": "DHCP Agent",
                        "host": "dhcp-host1"
                    },
                    "remediation": {
                        "code": "DHCPR001",
                        "message": "Re-enable DHCP for this network, then rerun this check."
                    }
                },
                {
                    "name:": "check2",
                    "description": "Check2 does cool stuff.",
                    "status": {
                        "code": "DS000",
                        "label": "OK",
                        "message": null,
                    },
                    "provider": {
                        "name": "DHCP Agent",
                        "host": "dhcp-host2"
                    },
                    "remediation": {}
                }
            ]
        }
    ]
}

Access control to /diagnostics is handled via standard policy definition. The default access control is admin_only, but operators can change this in policy.json as needed.

Benefits

While the main purpose of this effort is to spearhead diagnostics in Neutron and start building out the plumbing, this functionality is immediately valuable for consumers and out-of-tree plugins alike.

For example:

  • The python-don project [3] can implement diagnostic data collection for data used in its analysis.
  • The vmware-nsx plugin can migrate some of its operator CLI functionality [4] into diagnostic data.
  • The implementation can be enhanced to collect interface stats similar to how Nova diagnostics [5] does.

Future work

  • Once the API solidifies, a CLI can be added to support diagnostics.