NVMe Connector Healing Agent¶
https://blueprints.launchpad.net/cinder/+spec/nvmeof-client-raid-healing-agent
Daemon that monitors NVMe connections and MDRAID arrays created by the NVMe connector, identifies faulted volume replicas, requests new replicas and replaces faulted replicas with new ones.
Note
This spec has been superseded by NVMeoF Connection Agent.
Problem description¶
When the NVMe connector connects a replicated volume, OpenStack will see it as one volume, and has no way of monitoring managing and healing the replicas in these MDRAID arrays. This agent will take care of that.
It will monitor the state of the MDRAID arrays and reconcile their physical state on the host with expected state from the volume provisioner, replacing broken legs.
For backend volume replicas, it’s the storage array that takes care of monitoring and replacing unhealthy replicas.
NVMe MDRAID moves the data replication responsibility from the backend to the consumer.
Currently there’s no mechanism to monitor and heal these replicated volumes.
We cannot do it on the Cinder side, because even if the Cinder driver detected the issue and created a replacing volume, we have no mechanism to report the connection information of the replacing volume to the consumer.
So the monitoring and healing needs to be on the volume consumer side.
This agent will also be greatly beneficial for scenarios where certain replicas of an attached replicated volume go faulty, by notifying the volume provisioner of the faulty devices, they can be marked as faulty to avoid using old data on re-attachments and to replace them entirely.
Use Cases¶
When working with replicated NVMe volumes that are attached to an instance for a long time, one of the replicas may go faulty. This agent will detect it and attempt to replace it (self heal the MDRAID, without the need to detach and re-attach the volume).
Proposed change¶
Add an “NVMe agent” class that will be initialized by the NVMe connector during volume connection on a host.
Initializing this agent will spawn a monitoring task which will repeat periodically. We are proposing this to be a native thread if possible, but if necessary it can be an independent process.
First proposal was to use python Event Scheduler sched.scheduler, but other alternatives, such as spawning a separate process communicated to via socket, may be chosen instead. One key problem that would need to be addressed by this selection is a scenario where compute service goes down, while the VMs continue operating (and their volumes remain attached) - we don’t want to lose this agent in this case.
When initialized, the agent will read access information to the volume provisioner from a pre-determined config file location, with vendor specific format, the content of which should be provided there by the systems operator.
The task will monitor NVMe devices and MDRAID arrays built over them.
It will know which NVMe devices and MDRAID arrays to monitor based on metadata from the volume provisioner (backend) - which it will have a custom interface to.
It will notify volume provisioner if necessary of failed devices.
It will attempt to connect to new NVMe devices / replicas, replacing them in the MDRAID.
Typical self healing flow:
volume replica goes faulty
agent notices faulty replica, reports to provisioner
provisioner marks replica as bad (so it wont be used later unless synced)
agent keeps pulling volume information from provisioner
certain grace period passes, agent sees no state changes of faulty replica from provisioner, so it sends explicit request to replace replica
provisioner replaces replica and updates volume information
agent pulls volume replica information, notices a replica has changed
agent carries out replica replacement
Alternatives¶
Operator could use some own script to monitor connections and fix them manually
Data model impact¶
None
REST API impact¶
None
Security impact¶
Will call NVMe connector methods that do sudo executions of nvme and mdadm This will happen in the new agent task that will be spawned from os-brick.
Active/Active HA impact¶
None
Notifications impact¶
None
Other end user impact¶
None
Performance Impact¶
None
Other deployer impact¶
None
Developer impact¶
To allow multiple vendor implementations, the specific methods / logic for:
probing the volume provisioner
pulling / parsing volume metadata from provisioner
reporting volume state changes to provisioner
requesting provisioner to replace replica
Will need to be implemented on a per vendor basis.
The architecture is such that the agent will be a generic class that will provide the interface, and the kioxia implementation will be the first example of vendor-specific implementation.
Implementation¶
Assignee(s)¶
- Zohar Mamedov
zoharm
Work Items¶
NVMe connector will launch monitoring task on connect_volume if not running.
Task monitors NVMe devices and MDRAID arrays created by the connector.
When a replica goes faulty (as well as other events such as disconnects) call interface method for notifying volume provisioner.
When replicated volume devices are changed by the volume provisioner, reconcile the physical state of NVMe devices and MDRAID arrays on the host.
Dependencies¶
None
Testing¶
We should be able to accept this with just unit tests.
Documentation Impact¶
Document that using NVMe connector with replicated volumes will optionally launch this agent.
References¶
Architectural diagram https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png