Improve filesystem store driver to utilize NFS capabilities¶
https://blueprints.launchpad.net/glance/+spec/improve-filesystem-driver
Problem description¶
The filesystem backend of glance can be used to mount NFS share as local filesystem, so it is not required to store any special configs at glance side. Glance does not care about NFS server address or NFS share path at all, it just assumes that each image is stored in the local filesystem. The downside of this assumption is that glance is not aware whether NFS server is connected/available or not, NFS share is mounted or not and just keeps performing add/delete operations on local filesystem directory which later might causes problem in synchronization when NFS is back online.
Use case: In a k8s environment where OpenStack Glance is installed on top of OpenShift and NFS share is mounted via the Volume/VolumeMount interface, the Glance pod won’t start if NFS share isn’t ready. Whereas if NFS share is not available after Glance pod is available then upload operation will fail with following error:
sh-5.1$ openstack image create --container-format bare --disk-format raw --file /tmp/cirros-0.5.2-x86_64-disk.img cirros
ConflictException: 409: Client Error for url: https://glance-default-public-openstack.apps-crc.testing/v2/images/0ce1f894-5af7-44fa-987d-f4c47c77d0cf/file, Conflict
Even though the Glance Pod is still up, liveness and readiness probes starts failing and as a result the Glance Pods are marked as Unhealthy:
Normal Started 12m kubelet Started container glance-api
Warning Unhealthy 5m24s (x2 over 9m24s) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m24s (x3 over 9m24s) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m24s kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4m54s (x2 over 9m24s) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 4m54s kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Later in time, according to the failure threshold set for the Pod, the kubelet marks the Pod as Failed, and we can see a failure, and given that the policy is supposed to recreate it:
glance-default-single-0 0/3 CreateContainerError 4 (3m39s ago) 28m
$ oc describe pod glance-default-single-0 | tail
Normal Started 29m kubelet Started container glance-api
Warning Unhealthy 10m (x3 over 26m) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 10m kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 10m kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x4 over 26m) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x5 over 26m) kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x2 over 22m) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s (x3 over 22m) kubelet Readiness probe failed: Get "https://10.217.0.247:9292/healthcheck": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9m30s kubelet Liveness probe failed: Get "https://10.217.0.247:9292/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Failed 4m47s (x2 over 6m48s) kubelet Error: context deadline exceeded
Unlike other deployments (deployment != k8s) where even if NFS share is not available the glance service keeps running and uploads or deletes the data from local filesystem. In this case we can definitely say that NFS share is not available, the Glance won’t be able to upload any image in the filesystem local to the container and the Pod will be marked as failed and it fails to be recreated.
Proposed change¶
We are planning to add new plugin enable_by_files to healthcheck wsgi middleware in oslo.middleware which can be used by all openstack components to check if desired path is not present then report 503 <REASON> error or 200 OK if everything is OK.
In glance we can configure this healthcheck middleware as an application in glance-api-paste.ini as an application:
[app:healthcheck]
paste.app_factory = oslo_middleware:Healthcheck.app_factory
backends = enable_by_files (optional, default: empty)
# used by the 'enable_by_files' backend
enable_by_file_paths = /var/lib/glance/images/filename,/var/lib/glance/cache/filename (optional, default: empty)
# Use this composite for keystone auth with caching and cache management
[composite:glance-api-keystone+cachemanagement]
paste.composite_factory = glance.api:root_app_factory
/: api-keystone+cachemanagement
/healthcheck: healthcheck
The middleware will return “200 OK” if everything is OK, or “503 <REASON>” if not with the reason of why this API should not be used.
“backends” will the name of a stevedore extentions in the namespace “oslo.middleware.healthcheck”.
In glance, if local filesystem path is mounted on NFS share then we propose to add one marker file named .glance to NFS share and then use that file path to configure enable_by_files healthcheck middleware plugin as shown below:
[app:healthcheck]
paste.app_factory = oslo_middleware:Healthcheck.app_factory
backends = enable_by_files
enable_by_file_paths = /var/lib/glance/images/.glance
If NFS goes down or somehow the /healthcheck starts reporting 503 <REASON> admin can take appropriate actions to make NFS share available again.
Alternatives¶
Introduce few configuration options for filesystem driver which will help to detect if the NFS share is unmounted from underneath the Glance service. We proposed to introduce below new configuration options for the same:
filesystem_is_nfs_configured - boolean, verify if NFS is configured or not
filesystem_nfs_host - IP address of NFS server
filesystem_nfs_share_path - Mount path of NFS mapped with local filesystem
filesystem_nfs_mount_options - Mount options to be passed to NFS client
rootwrap_config - To run commands as root user
If filesystem_is_nfs_configured is set, i.e. if NFS is configured then deployer must specify filesystem_nfs_host and filesystem_nfs_share_path config options in glance-api.conf otherwise the respective glance store will be disabled and will not be used for any operation.
We are planning to use existing os-brick library (already used by cinder driver of glance_store) to create the NFS client with the help of above configuration options and check if NFS share is available or not during service initialization as well as before each image upload/import/delete operation. If NFS share is not available during service initialization then add and delete operations will be disabled but if NFS goes down afterwards we will raise HTTP 410 (HTTP GONE) response to the user.
Glance still doesn’t have capability to check whether particular NFS store has storage capability to store any particular image beforehand. Also it does not have capability to verify if network failure occurs during upload/import operation.
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None
Performance Impact¶
None
Other deployer impact¶
Need to configure healthcheck middleware for glance.
Developer impact¶
None
Implementation¶
Assignee(s)¶
- Primary assignee:
abhishekk
- Other contributors:
None
Work Items¶
Add enable_by_files healthcheck backend in oslo.middleware
Document how to configure enable_by_files healthcheck middleware
Unit/Functional tests for coverage
Dependencies¶
None
Testing¶
Unit Tests
Functional Tests
Tempest Tests
Documentation Impact¶
Need to document new behavior of filesystem driver if NFS and healthcheck middleware is configured.
References¶
Oslo.Middleware Implementation - https://review.opendev.org/920055