NVMe Connector Healing Agent¶
Daemon that monitors NVMe connections and MDRAID arrays created by the NVMe connector, identifies faulted volume replicas, requests new replicas and replaces faulted replicas with new ones.
This spec has been superseded by NVMeoF Connection Agent.
When the NVMe connector connects a replicated volume, OpenStack will see it as one volume, and has no way of monitoring managing and healing the replicas in these MDRAID arrays. This agent will take care of that.
It will monitor the state of the MDRAID arrays and reconcile their physical state on the host with expected state from the volume provisioner, replacing broken legs.
For backend volume replicas, it’s the storage array that takes care of monitoring and replacing unhealthy replicas.
NVMe MDRAID moves the data replication responsibility from the backend to the consumer.
Currently there’s no mechanism to monitor and heal these replicated volumes.
We cannot do it on the Cinder side, because even if the Cinder driver detected the issue and created a replacing volume, we have no mechanism to report the connection information of the replacing volume to the consumer.
So the monitoring and healing needs to be on the volume consumer side.
This agent will also be greatly beneficial for scenarios where certain replicas of an attached replicated volume go faulty, by notifying the volume provisioner of the faulty devices, they can be marked as faulty to avoid using old data on re-attachments and to replace them entirely.
When working with replicated NVMe volumes that are attached to an instance for a long time, one of the replicas may go faulty. This agent will detect it and attempt to replace it (self heal the MDRAID, without the need to detach and re-attach the volume).
Add an “NVMe agent” class that will be initialized by the NVMe connector during volume connection on a host.
Initializing this agent will spawn a monitoring task which will repeat periodically. We are proposing this to be a native thread if possible, but if necessary it can be an independent process.
First proposal was to use python Event Scheduler sched.scheduler, but other alternatives, such as spawning a separate process communicated to via socket, may be chosen instead. One key problem that would need to be addressed by this selection is a scenario where compute service goes down, while the VMs continue operating (and their volumes remain attached) - we don’t want to lose this agent in this case.
When initialized, the agent will read access information to the volume provisioner from a pre-determined config file location, with vendor specific format, the content of which should be provided there by the systems operator.
The task will monitor NVMe devices and MDRAID arrays built over them.
It will know which NVMe devices and MDRAID arrays to monitor based on metadata from the volume provisioner (backend) - which it will have a custom interface to.
It will notify volume provisioner if necessary of failed devices.
It will attempt to connect to new NVMe devices / replicas, replacing them in the MDRAID.
Typical self healing flow:
volume replica goes faulty
agent notices faulty replica, reports to provisioner
provisioner marks replica as bad (so it wont be used later unless synced)
agent keeps pulling volume information from provisioner
certain grace period passes, agent sees no state changes of faulty replica from provisioner, so it sends explicit request to replace replica
provisioner replaces replica and updates volume information
agent pulls volume replica information, notices a replica has changed
agent carries out replica replacement
Operator could use some own script to monitor connections and fix them manually
Data model impact¶
REST API impact¶
Will call NVMe connector methods that do sudo executions of nvme and mdadm This will happen in the new agent task that will be spawned from os-brick.
Active/Active HA impact¶
Other end user impact¶
Other deployer impact¶
To allow multiple vendor implementations, the specific methods / logic for:
probing the volume provisioner
pulling / parsing volume metadata from provisioner
reporting volume state changes to provisioner
requesting provisioner to replace replica
Will need to be implemented on a per vendor basis.
The architecture is such that the agent will be a generic class that will provide the interface, and the kioxia implementation will be the first example of vendor-specific implementation.
- Zohar Mamedov
NVMe connector will launch monitoring task on connect_volume if not running.
Task monitors NVMe devices and MDRAID arrays created by the connector.
When a replica goes faulty (as well as other events such as disconnects) call interface method for notifying volume provisioner.
When replicated volume devices are changed by the volume provisioner, reconcile the physical state of NVMe devices and MDRAID arrays on the host.
We should be able to accept this with just unit tests.
Document that using NVMe connector with replicated volumes will optionally launch this agent.
Architectural diagram https://wiki.openstack.org/wiki/File:Nvme-of-add-client-raid1-detail.png