Improve filesystem store driver to utilize NFS capabilities

https://blueprints.launchpad.net/glance/+spec/improve-filesystem-driver

Problem description

The filesystem backend of glance can be used to mount NFS share as local filesystem, so it is not required to store any special configs at glance side. Glance does not care about NFS server address or NFS share path at all, it just assumes that each image is stored in the local filesystem. The downside of this assumption is that glance is not aware whether NFS server is connected/available or not, NFS share is mounted or not and just keeps performing add/delete operations on local filesystem directory which later might causes problem in synchronization when NFS is back online.

Use case: In a k8s environment where OpenStack Glance is installed on top of OpenShift and NFS share is mounted via the Volume/VolumeMount interface, the Glance pod won’t start if NFS share isn’t ready. Whereas if NFS share is not available after Glance pod is available then upload operation will fail with following error:

sh-5.1$ openstack image create --container-format bare --disk-format raw --file /tmp/cirros-0.5.2-x86_64-disk.img cirros
ConflictException: 409: Client Error for url: https://glance-default-public-openstack.apps-crc.testing/v2/images/0ce1f894-5af7-44fa-987d-f4c47c77d0cf/file, Conflict

Even though the Glance Pod is still up, liveness and readiness probes starts failing and as a result the Glance Pods are marked as Unhealthy:

Normal   Started         12m                    kubelet            Started container glance-api
  Warning  Unhealthy       5m24s (x2 over 9m24s)  kubelet            Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy       5m24s (x3 over 9m24s)  kubelet            Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy       5m24s                  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy       4m54s (x2 over 9m24s)  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy       4m54s                  kubelet            Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Later in time, according to the failure threshold set for the Pod, the kubelet marks the Pod as Failed, and we can see a failure, and given that the policy is supposed to recreate it:

glance-default-single-0                                         0/3     CreateContainerError   4 (3m39s ago)   28m

$ oc describe pod glance-default-single-0 | tail
Normal   Started    29m                    kubelet   Started container glance-api
Warning  Unhealthy  10m (x3 over 26m)      kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  10m                    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  10m                    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  9m30s (x4 over 26m)    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  9m30s (x5 over 26m)    kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  9m30s (x2 over 22m)    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  9m30s (x3 over 22m)    kubelet   Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  9m30s                  kubelet   Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning  Failed     4m47s (x2 over 6m48s)  kubelet   Error: context deadline exceeded

Unlike other deployments (deployment != k8s) where even if NFS share is not available the glance service keeps running and uploads or deletes the data from local filesystem. In this case we can definitely say that NFS share is not available, the Glance won’t be able to upload any image in the filesystem local to the container and the Pod will be marked as failed and it fails to be recreated.

Proposed change

We are planning to add new plugin enable_by_files to healthcheck wsgi middleware in oslo.middleware which can be used by all openstack components to check if desired path is not present then report 503 <REASON> error or 200 OK if everything is OK.

In glance we can configure this healthcheck middleware as an application in glance-api-paste.ini as an application:

[app:healthcheck]
paste.app_factory = oslo_middleware:Healthcheck.app_factory
backends = enable_by_files (optional, default: empty)
# used by the 'enable_by_files' backend
enable_by_file_paths = /var/lib/glance/images/filename,/var/lib/glance/cache/filename (optional, default: empty)

# Use this composite for keystone auth with caching and cache management
[composite:glance-api-keystone+cachemanagement]
paste.composite_factory = glance.api:root_app_factory
/: api-keystone+cachemanagement
/healthcheck: healthcheck

The middleware will return “200 OK” if everything is OK, or “503 <REASON>” if not with the reason of why this API should not be used.

“backends” will the name of a stevedore extentions in the namespace “oslo.middleware.healthcheck”.

In glance, if local filesystem path is mounted on NFS share then we propose to add one marker file named .glance to NFS share and then use that file path to configure enable_by_files healthcheck middleware plugin as shown below:

[app:healthcheck]
paste.app_factory = oslo_middleware:Healthcheck.app_factory
backends = enable_by_files
enable_by_file_paths = /var/lib/glance/images/.glance

If NFS goes down or somehow the /healthcheck starts reporting 503 <REASON> admin can take appropriate actions to make NFS share available again.

Alternatives

Introduce few configuration options for filesystem driver which will help to detect if the NFS share is unmounted from underneath the Glance service. We proposed to introduce below new configuration options for the same:

  • filesystem_is_nfs_configured - boolean, verify if NFS is configured or not

  • filesystem_nfs_host - IP address of NFS server

  • filesystem_nfs_share_path - Mount path of NFS mapped with local filesystem

  • filesystem_nfs_mount_options - Mount options to be passed to NFS client

  • rootwrap_config - To run commands as root user

If filesystem_is_nfs_configured is set, i.e. if NFS is configured then deployer must specify filesystem_nfs_host and filesystem_nfs_share_path config options in glance-api.conf otherwise the respective glance store will be disabled and will not be used for any operation.

We are planning to use existing os-brick library (already used by cinder driver of glance_store) to create the NFS client with the help of above configuration options and check if NFS share is available or not during service initialization as well as before each image upload/import/delete operation. If NFS share is not available during service initialization then add and delete operations will be disabled but if NFS goes down afterwards we will raise HTTP 410 (HTTP GONE) response to the user.

Glance still doesn’t have capability to check whether particular NFS store has storage capability to store any particular image beforehand. Also it does not have capability to verify if network failure occurs during upload/import operation.

Data model impact

None

REST API impact

None

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

Need to configure healthcheck middleware for glance.

Developer impact

None

Implementation

Assignee(s)

Primary assignee:

abhishekk

Other contributors:

None

Work Items

  • Add enable_by_files healthcheck backend in oslo.middleware

  • Document how to configure enable_by_files healthcheck middleware

  • Unit/Functional tests for coverage

Dependencies

None

Testing

  • Unit Tests

  • Functional Tests

  • Tempest Tests

Documentation Impact

Need to document new behavior of filesystem driver if NFS and healthcheck middleware is configured.

References