Per-Process Healthcheck Endpoints¶
This spec proposes adding per-process HTTP healthcheck endpoints to Manila services, enabling operators and orchestration frameworks to determine whether a service is healthy and not simply alive. The design follows the approach approved for Nova in the per-process-healthchecks specification and aligns with the IETF Health Check Response draft [1].
Problem description¶
Manila operators have no native, lightweight mechanism to determine whether
a Manila service (manila-api, manila-scheduler, manila-share,
manila-data) is healthy. Current approaches suffer from the following
limitations:
Service heartbeat table (
services.report_count): proves only that the periodic task loop is running, not that the service can serve requests or communicate with backends.Log scraping: fragile, delayed, and can’t be combined well with modern monitoring stacks (Prometheus, Kubernetes liveness/readiness probes).
Oslo healthcheck middleware: applies only to
manila-apibehind WSGI, and does not reflect the health of non-API services.
For manila-share specifically, the gap is critical: a share process can
appear “up” via heartbeat while its storage backend driver is unreachable or
not healthy.
This problem is more evident in podified deployments, where Kubernetes probes
are the primary mechanism for managing service lifecycle. In this context, the
manila-operator provides a health probe script (healthcheck.py) that
runs an HTTP server on port 8080 inside manila-share and
manila-scheduler pod. On every HTTP GET it executes
manila-manage service list, queries the services table, and returns
HTTP 200 if all expected rows show state = :-) and status = enabled.
This validates that the process is alive (the probe HTTP server responds) and
that the periodic heartbeat task is ticking (the :-) state is current).
However, it does not validate:
Database reachability from the service process itself (the probe forks a separate
manila-managesubprocess that has its own DB connection)RPC / message bus connectivity (a service disconnected from RabbitMQ still shows
:-)as long as its heartbeat loop runs)Storage backend driver reachability (CephFS, NetApp, LVM can be completely unreachable while the heartbeat keeps ticking)
The actual request processing ability (the service could be stuck, deadlocked, or failing every operation for specific reasons)
In a podified environment this is particularly damaging because Kubernetes relies on probe results to make restart and traffic-routing decisions.
Use Cases¶
As an operator, I need a single HTTP endpoint per Manila process that tells me whether the service can serve requests: this allows to configure Kubernetes liveness and readiness probes with meaningful health signals. I need to run
curl localhost:<port>/health | jqand see which subsystem is failing, when it last succeeded, and what the error was. This avoids the need for humans and tools to cross-reference logs, service lists, and backend connectivity manually.As a deployment tool (Kubernetes, baremetal), I need a local health check with no dependencies on external hosts, databases, or authentication so that I can integrate it with container runtime probes and service watchdogs.
As a monitoring operator, I need structured, machine-parseable health data (per-check status, timestamps, error details) that I can scrape into Prometheus and use for granular alerting rules.
Proposed change¶
Add a lightweight, in-process HTTP endpoint (/health) to each Manila
binary, exposing cached health state as IETF-aligned JSON. The design
inherits its core principles from the Nova specification:
Passive, not active: health state is updated as a side effect of normal operations (database queries, RPC calls, driver calls), not by active probes that add load. The
@healthcheckdecorator is applied at the manager layer and observes the outcome of calls that flow through drivers: drivers themselves do not need to implement health check logic (see Driver impact).Three-state model:
pass,warn,failper check, following the IETF Health Check Response draft.Disabled by default: opt-in via a
[healthcheck]configuration section. The endpoint binds tolocalhostor a Unix socket by default.Unauthenticated: designed for local consumption by Kubernetes probes, systemd watchdogs, or human debugging via
curl.
Main Health check definitions¶
databaseTracks the outcome of database operations that occur naturally during service operation (share create, delete, manage, status updates). A successful DB query sets the check to
pass; a caughtDBErrororDBConnectionErrorsets it tofail. For idle services (e.g.,manila-data), the periodic heartbeat task (report_service_to_db, which runs every ~10 seconds) provides a natural source of database observations, keeping this check fresh even when no user-facing operations are in progress.message_busTracks RPC connectivity by observing the outcome of actual RPC operations via the
@healthcheckdecorator. A successful RPC call setspass; a transient connection error setswarn; a persistent failure setsfail. For idle services that receive infrequent RPC calls (e.g.,manila-data), enablingoslo.messaging’srpc_ping_enabledoption provides a periodic RPC ping that keeps this check fresh without adding synthetic load (see Implementation details).backend_driver(manila-shareonly)Tracks whether the configured storage backend drivers can perform operations. In multi-backend deployments, where a single
manila-shareprocess hosts more than one backend, each backend is tracked as a separate element within thebackend_drivercheck array, identified by itscomponentId(thehost@backendidentifier from the configuration, e.g.,storage01@cephfs). This is the same identifier visible inmanila-manage service list, logs, and theservicestable. This ensures that a failure in one backend does not mask the healthy state of another, and operators can pinpoint exactly which backend is degraded. At startup, the share manager’sinitmethod invokes each driver’s_check_for_setup_error()to detect recoverable and irrecoverable failures, providing the initialbackend_driverhealth state before any user-facing operations occur. Updated from driver calls during normal share lifecycle (create_share,delete_share,_update_share_stats). The periodic stats reporting task (_report_driver_status, which runs periodically per backend) provides a natural heartbeat for this check. It can be extended depending on the driver implementation to report more events.
Status definitions¶
passAll health indicators are functioning normally and their TTLs are valid. HTTP status code: 200.
warnOne or more indicators show partial or transient failures, or a TTL has expired but the most recent observation was
pass. The service may still be able to serve requests. HTTP status code: 200.failOne or more critical indicators report persistent errors, or all TTLs have expired with no recent successful observation. The service cannot reliably serve requests. HTTP status code: 503.
The overall service status is the worst state across all registered checks:
if any check is fail, the service status is fail; if any check is
warn (and none is fail), the service status is warn.
Each check entry may include an output field carrying a short,
sanitized reason for the current status (see Implementation details).
The output field is omitted when the status is pass.
TTL and staleness¶
Each health indicator has a configurable TTL (default: 300 seconds). If no
update (pass or fail) is received within the TTL window, the indicator
transitions to warn to signal staleness. This prevents a quiet service
(one receiving no traffic) from indefinitely reporting pass based on an
observation that may no longer reflect reality.
For services that can remain idle for extended periods (notably
manila-data), TTL expiry would cause spurious warn transitions even
though the service is perfectly healthy. This is mitigated by leveraging
existing periodic operations that naturally refresh each check:
database: the periodic heartbeat task (report_service_to_db) runs every ~10 seconds across all services, keeping thedatabasecheck well within its TTL.message_bus: enablingoslo.messaging’srpc_ping_enabledoption introduces a lightweight periodic RPC ping that refreshes themessage_buscheck without synthetic application-level probes.backend_driver: the_report_driver_statusperiodic task already runs on a short interval per backend inmanila-share.
With these periodic sources in place, no Manila service should experience TTL-driven staleness under normal operation.
Response format¶
The endpoint returns JSON following the IETF Health Check Response draft,
adapted for Manila service names. The response uses the
application/health+json content type as defined by the IETF draft:
{
"status": "pass",
"version": "1.0",
"serviceId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"description": "manila-share",
"notes": {
"host": "storage-node-1.cloud",
"backend": "storage01@cephfs"
},
"checks": {
"database": [
{
"status": "pass",
"time": "2025-06-15T10:30:00Z"
}
],
"message_bus": [
{
"status": "pass",
"time": "2025-06-15T10:30:00Z"
}
],
"backend_driver": [
{
"componentId": "storage01@cephfs",
"componentType": "datastore",
"status": "pass",
"time": "2025-06-15T10:29:45Z"
},
{
"componentId": "storage02@netapp",
"componentType": "datastore",
"status": "pass",
"time": "2025-06-15T10:29:50Z"
}
]
}
}
Example failure response (HTTP 503):
{
"status": "fail",
"version": "1.0",
"serviceId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"description": "manila-share",
"notes": {
"host": "storage-node-1.cloud",
"backend": "storage01@cephfs"
},
"checks": {
"database": [
{
"status": "pass",
"time": "2025-06-15T10:30:00Z"
}
],
"message_bus": [
{
"status": "pass",
"time": "2025-06-15T10:30:00Z"
}
],
"backend_driver": [
{
"componentId": "storage01@cephfs",
"componentType": "datastore",
"status": "fail",
"time": "2025-06-15T10:28:00Z",
"output": "CephFS: mon_command failed: [errno 110] Connection timed out"
},
{
"componentId": "storage02@netapp",
"componentType": "datastore",
"status": "pass",
"time": "2025-06-15T10:29:50Z"
}
]
}
}
Implementation details¶
The implementation introduces a new manila/healthcheck/ module containing:
HealthcheckManager: a singleton that maintains an in-memory dictionary of
HealthcheckStatusItem objects, each tracking a named check’s current
status, last-updated timestamp, TTL, and optional error output. For checks
that cover multiple components (e.g., backend_driver in a multi-backend
manila-share), the manager stores one HealthcheckStatusItem per
(name, component_id) pair and groups them into the same IETF check
array in the response. The manager exposes
update(name, status, output=None, component_id=None) for programmatic
updates and a get_response() method that computes the aggregate status
and returns the JSON payload.
@healthcheck decorator: a function decorator can be used to wrap existing
methods and capture health state as a side effect. On successful return, it updates
the named check to pass. If a listed exception type is raised, it updates
the check to fail with the exception message. An optional
component_attr parameter names an attribute on self that is resolved
at call time and used as the componentId for that check entry. This
allows a single check name (e.g., backend_driver) to track multiple
components independently (e.g., one per storage backend).
# manila/share/manager.py
from manila.healthcheck import healthcheck
class ShareManager(manager.SchedulerDependentManager):
@healthcheck('database', [db_exc.DBError, db_exc.DBConnectionError])
def create_share_instance(self, context, share_id, ...):
# existing implementation unchanged
...
@healthcheck('backend_driver', [exception.ShareBackendException],
component_attr='host')
def _report_driver_status(self, context):
# periodic task runs every 60s per backend -- natural heartbeat
# component_attr='host' resolves self.host at call time
# (e.g., "storage-node-1@cephfs") and uses it as componentId
...
Output field guidelines: the output field in each check entry
carries the most recent error or warning message.
Stack traces are never included in health responses; they belong to service
logs. Manila already provides an established pattern for surfacing backend
exceptions through the “Asynchronous User Messages” feature, which converts
raw exceptions into strings with enough diagnostic hints for troubleshooting
without leaking sensitive details. The healthcheck output field must apply
equivalent sanitization. Drivers that register custom checks via the optional
extension point control their own output format, but the same principle
applies: short, actionable messages suitable for operator dashboards and
alerts.
Message bus tracking: RPC connectivity is tracked via the
@healthcheck decorator on methods that perform RPC calls, catching
connection errors during actual operations. For services that may remain
idle for extended periods (e.g., manila-data), enabling
oslo.messaging’s rpc_ping_enabled option is recommended: it
provides a periodic RPC ping that keeps the message_bus check fresh
without requiring user-facing traffic.
HTTP server: a minimal, single-threaded HTTP server (or Unix socket
listener) that serves the /health endpoint. It reads the in-memory state
from the HealthcheckManager and serializes the response. No
authentication is performed. A per-request timeout is enforced so that
idle or slow connections cannot block the server thread. The server runs
in a dedicated thread within
the service process. On graceful shutdown, the service’s stop() method
(called by oslo.service on SIGTERM [2]) signals the HTTP
server thread to stop accepting connections and exit cleanly.
Health state is stored in-memory only and is reset on process restart.
Configuration¶
A new [healthcheck] configuration section is added to manila.conf:
[healthcheck]
# Listen URI. Supports tcp:// and unix:// schemes.
# Unix sockets are preferred: they avoid port conflicts when multiple
# services run on the same host and are not exposed to the network.
# When TCP is needed (e.g., for Kubernetes httpGet probes), customize
# the port per service to avoid conflicts.
# Disabled when empty (default).
# uri = unix:///run/manila/manila-share.sock
# uri = tcp://localhost:9399
# How long (seconds) a health indicator remains valid before going stale.
# Default: 300
# ttl = 300
# HTTP Cache-Control max-age header value (seconds).
# Default: 10
# cache_control = 10
# Enable backend driver health check (manila-share only).
# Default: true
# backend_check = true
Integration with Kubernetes¶
When Manila is deployed in Kubernetes, the healthcheck endpoint enables proper probe configuration:
Kubernetes defines three probe types with distinct semantics:
Startup probe: gates liveness and readiness probes until the service has completed initialization (driver setup,
_check_for_setup_error(), reconciliation). Neither liveness nor readiness probes run until the startup probe succeeds.Liveness probe: determines whether the service is stuck or deadlocked and should be restarted.
Readiness probe: determines whether the service can accept traffic. A failing readiness probe removes the pod from Service endpoints without restarting it.
All three probes use the same /health endpoint. The differentiation
is achieved through HTTP status codes and probe thresholds: when any
check is fail, the endpoint returns HTTP 503; otherwise it returns
HTTP 200. The startup probe uses a generous budget
(failureThreshold: 30, i.e., ~300 seconds) to allow driver
initialization and reconciliation to complete. The readiness probe uses
tight thresholds to quickly remove a degraded pod from Service endpoints
(stop routing traffic), while the liveness probe uses lenient thresholds
(failureThreshold: 3) so that only sustained failures trigger a pod
restart.
# Startup: generous — allow time for driver init and reconciliation
startupProbe:
httpGet:
path: /health
port: 9399
periodSeconds: 10
failureThreshold: 30
# Liveness: lenient — only restart on sustained failure
livenessProbe:
httpGet:
path: /health
port: 9399
periodSeconds: 10
failureThreshold: 3
# Readiness: tight — quickly stop traffic to degraded pods
readinessProbe:
httpGet:
path: /health
port: 9399
periodSeconds: 5
Future revisions may introduce separate /livez and /readyz
endpoints following the pattern established by the Kubernetes API server,
if operational experience shows that the single-endpoint approach is
insufficient for distinguishing startup, liveness, and readiness
semantics.
This replaces the current approach described in the first section of the
spec, and instead of forking a manila-manage subprocess on every probe
invocation it is possible to query the dedicated endpoint.
Alternatives¶
Extend the existing Oslo healthcheck middleware: the Oslo middleware only
works with WSGI-hosted services (manila-api). It cannot be attached to
RPC-based services (manila-share, manila-scheduler, manila-data),
which are the services with the most critical health monitoring gaps.
Additionally, the existing middleware leaks system information and does not
support the three-state model.
Active health probing (synthetic operations): instead of passively observing real operations, the service could periodically execute synthetic health checks (e.g., a test database query, a test RPC call). This adds load, introduces potential failures, and requires tuning intervals based on the system size.
External health aggregator service: a dedicated service that polls each Manila service for health data. This introduces a new dependency, a new failure mode, and additional network traffic. The in-process approach eliminates these concerns.
Shared Oslo library: the healthcheck framework could live in
oslo.service rather than being implemented per project. This is a
desirable long-term outcome and is not mutually exclusive with this spec.
If Nova’s implementation lands first and is generalized into Oslo, Manila
can adopt the shared library, but it would require additional coordination.
The Manila-specific aspects (backend driver checks) would
remain in Manila regardless.
Data model impact¶
None. Health state is maintained in-memory within each service process and is not persisted to the database. Process restart clears all health state, which is re-established from subsequent real operations.
REST API impact¶
No changes to the existing Manila public REST API.
The healthcheck endpoint (/health) is a separate, unauthenticated HTTP
server running on a distinct port or Unix socket. It is not part of the
Manila API service and is not versioned with the Manila API microversion
scheme.
In addition, the healthcheck endpoint is disabled by default, with no impact
for existing deployment where other approaches are used.
Driver impact¶
No mandatory driver changes. The @healthcheck decorator is applied to
share manager methods that call into drivers, not to the drivers themselves.
Driver exceptions are caught by the decorator and used to update health state.
An optional extension point allows drivers to register custom health checks
if they have domain-specific health signals (e.g., CephFS checking cluster
health via ceph status). This is not required for the initial
implementation.
Security impact¶
The healthcheck endpoint is disabled by default. It must be explicitly enabled via the
[healthcheck]configuration section.When enabled, the endpoint is unauthenticated and binds to localhost (
127.0.0.1) by default. This limits access to processes on the same host (Kubernetes probes, human debugging viaoc exec).The endpoint can optionally be bound to
0.0.0.0or a specific IP address for external consumers (Prometheus, operator reconciler). This is an explicit opt-in configuration and should be properly documented.The response body might contain service-level metadata (hostname, backend name, error messages). While this information is not privileged in a typical deployment, binding to
0.0.0.0in an untrusted network should be avoided. Documentation will recommend the best policies to restrict access when exposing the endpoint beyond localhost.Because the endpoint is unauthenticated, the
outputfield must never contain raw exception messages, credentials, connection strings, or backend internals. All error output is sanitized using the same approach established by Manila’s Asynchronous User Messages feature (see Implementation details).
Notifications impact¶
None.
Other end user impact¶
None. The healthcheck endpoint is an operator-facing feature. End users interact with Manila through the public API, which is unaffected.
Performance Impact¶
Minimal. The @healthcheck decorator adds a dictionary update
(setting a status enum and timestamp) on the success or failure path of
decorated methods. This is a constant-time, in-memory operation with
negligible overhead.
The HTTP server is single-threaded and serves only /health requests.
Under normal Kubernetes probe intervals (every 5-10 seconds), this amounts
to a trivial workload.
Health state reads by the HTTP handler are not locked – eventual consistency between the decorator updating state and the handler reading it is acceptable for health reporting purposes.
The approach eliminates the per-probe subprocess fork overhead of the
current healthcheck.py sidecar, which spawns a manila-manage process
on every invocation. This is a net performance improvement.
Other deployer impact¶
New configuration section
[healthcheck]with options described in Configuration.The feature is disabled by default; no action required for deployments that do not opt in.
Podified deployments using
openstack-k8s-operatorswill benefit from operator-level integration once themanila-operatoris updated to configure healthcheck URIs and Kubernetes probes.
Developer impact¶
Developers adding new Manila service functionality that interacts with
databases, RPC, or storage backends should apply the @healthcheck
decorator to relevant methods. This is a lightweight annotation that does
not change method signatures or behavior.
Upgrade impact¶
None. The healthcheck endpoint is disabled by default and has no effect on existing deployments that do not opt in.
Implementation¶
Assignee(s)¶
- Primary assignee:
Francesco Pantano (fmount)
- Other contributors:
Goutham Pacha Ravi (gouthampravi)
Work Items¶
Core framework (
manila/healthcheck/module):HealthcheckManagerclass with in-memory state store@healthcheckdecoratorHTTP server
[healthcheck]configuration options registration
manila-share integration:
Apply
@healthcheck('database', ...)to key share manager methodsApply
@healthcheck('backend_driver', ...)to driver-calling methods and_report_driver_status
Other service integration:
manila-scheduler:databaseandmessage_buschecksmanila-api:databaseandmessage_buschecks (alongside or replacing Oslo healthcheck middleware)manila-data:databaseandmessage_buschecks
Dependencies¶
None
Testing¶
Unit tests¶
HealthcheckManager: state transitions, TTL expiry, per-component tracking, and aggregate status computation.@healthcheckdecorator: pass/fail updates, exception re-raising,component_attrresolution.HTTP server: response format, status codes, and
Cache-Controlheader.
Tempest tests¶
Integration test in a multi-service deployment: verify that each Manila service exposes a healthy
/healthendpoint after successful startup and a valid JSON response is returned.No negative testing via Tempest (inducing failures requires infrastructure manipulation not suited to Tempest).
Documentation Impact¶
Admin guide: new section on configuring and using per-process healthchecks and security considerations for exposing the endpoint beyond localhost.
Configuration reference: document all
[healthcheck]options.