Currently nailgun volume manager is not flexible and customizable enough to address many needs of users. For example, some users want some volumes to be untouched during OS provisioning, some users want it to be possible to deploy software RAIDs or configure FS mount options, etc.
There are use cases which aren’t covered with the fuctionality of current implementation of volume manager.
These use cases include at least the following:
Sometimes when a node is going to be re-provisioned there could be volumes (partitions, logical volumes, MD devices) which user wants to remain untouched.
FS mount options
Sometimes user needs to mount some file systems using specific options, like noatime or ro, etc.
Currently we install bootloader on all hard drives, but it does not always correspond to what user wants.
Flexible partitioning scheme
Currently we have predefined partition scheme which assumes, for example, that we put root file system on logical volume and that we don’t create separate file system for /var. These assumptions limit users in their abilities to create a partition scheme they might want.
Pluggable partitioning scheme
Some Fuel plugins assume we need to have additional partitions on node. Currently, plugin partitions can conflict with existent partitions and we need to resolve these potential conflicts.
Provisioning process in general can be considered as the following set of steps:
+---------+ +----------+ +-----------+ +-----------+ | | | | | | | | |discovery+--> |allocation+--> |OS building+--> |OS copying | | | | | | | | | +---------+ +----------+ +-----------+ +-----------+
All those main steps can be implemented in a monolithic manner or they can be a set of separable modules/plugins/extensions.
This step is when we try to find out which hard drives are available on a node. Anaconda and debian-installer do the same at the very beginning of provisioning process. In our case this step is implemented as a separate service which is called Nailgun Agent.
On this step the default partitioning scheme is generated. This allocation step can be data driven when, for example, a user of a provisioning agent defines which file systems she needs to create and their priorities but not their exact sizes. Again anaconda and debian-installer do the same using some default hard coded or user defined (kickstart/preseed) partitioning metadata. In our case it is implemented as the volume manager module in Nailgun.
On this step OS is built from scratch using packages repositories or any other available mechanisms. Anaconda builds OS using rpm packages and yum. Debian-installer uses deb packages and debootstrap. In terms of Fuel this step is exactly what we call OS image building. In contrast to anaconda and debian-installer we build OS just once somewhere on the master node or on a developer node during ISO building. We then just copy this pre-built OS image on all provisioned nodes. This step indirectly depends on the previous step (step 2) because a user might be potentially interested in assigning some specific options for a particular file system. Step 2 (allocation) is exactly the place where we define which partitions and file systems we need. OS building (or equivalently OS image building) being implemented in the scope of Fuel Agent can be potentially run on the slave node if, for example, this node requires specific file system options.
This step makes sense only for image based approach when we build OS remotely. For example, anaconda and debian-installer build OS right on the file system where it is going to live on a node.
Anaconda and debian-installer implement these four steps in a monolithic manner. For example, we can not separate OS building step from the whole provisioning process. In case of Fuel all these steps implemented as separate components. Currently, Fuel Agent implements steps 3 and 4, but it looks like Fuel Agent is the right place where to implement also steps 1 and 2 . This spec does not concern step 1. Re-implementing the functionality of Nailgun Agent in the scope of Fuel Agent is a deal of a separate feature. This spec is totally about step 2.
The suggestion is to implement dynamic volume allocation on Volume Manager side keeping as much code as possible in Fuel Agent and reusing it. The motivation behind is:
On the other hand we are moving towards modular Fuel architecture, so, it looks like it is the place where we can start putting our efforts towards modularisation. The suggestion is to implement current volume manager in nailgun as extension. Being installed this volume manager extension imports Fuel Agent code in order to generate volume allocation (metadata/UI driven). The default volume allocation should be configurable via allocation metadata. A user then can modify this default allocation on the disk management tab on UI. If other extensions (ceph or mongo, etc.) need to modify volume allocation scheme they need to use volume manager extension for this and they need to interact with it only via its API.
So, the feature can be considered as two independent tasks:
The coverage scheme then will be as follows:
+-------------------------+ +----------------------------+ |Nailgun & vol. extension | | Fuel Agent | +-------------------------+ +----------------------------+ +---------+ +----------+ +-----------+ +-----------+ | | | | | | | | |discovery+--> |allocation+--> |OS building+--> |OS copying | | | | | | | | | +---------+ +----------+ +-----------+ +-----------+
More detailed scheme how it will work:
VolumeManager +--------------------+ | +----------------+ | | | objects | | | |(from fuel_agent| | | | import objects)| | | +----------------+ | | | +-----------+ | +------------+ +---> | | | new volumes| | | nailgun | | | allocation | <---+ | | | algorithm | | +-----------+ | +------------+ | +---------+----------+ | | +---------------+ | | serialize | | | ready to use | | |PartitionScheme| | +---------------+ | | fuel_agent | +---------v-------------------------+ | +---------+ +----------------+ | | | objects | | partitioning | | | +---------+ | provisioning | | | +----------------+ | | | | +-------------------------------+ | | | NEW DataDriver | | | | (deserialize obtained | | | | PartitionScheme) | | | +-------------------------------+ | +-----------------------------------+
New volumes allocation algorithm will be implement first in terms of Fuel Agent and then used (imported but not moved) in volume manager.
Dynamic allocation metadata could look like (exact format will be found during actual implementation):
- id: 1 type: "fs" mount: "/boot" device_id: 9 fs_type: "ext2" - id: 2 type: "fs" mount: "/" device_id: 5 fs_type: "ext4" - id: 3 type: "fs" mount: "swap" device_id: 6 fs_type: "swap" - id: 4 type: "fs" device_id: 7 mount: "/var/lib/mysql" fs_type: "ext4" block_size: "4K" - id: 5 type: "lv" vg_id: 8 name: "root" minsize: "10G" bestsize: "15G" priority: 1000 - id: 6 type: "lv" vg_id: 8 minsize: "1G" maxsize: "8G" priority: 200 name: "swap" - id: 7 type: "partition" minsize: "20G" device_id: __auto__ - id: 8 type: "vg" name: "os" minsize: __auto__ pvs_id: __auto__ - id: 9 type: "md" level: "mirror" minsize: "200M" maxsize: "400M" bestsize: "200M" numactive: 2 numspares: 1 devices_id: __auto__ spares_id: __auto__
The format of these metadata should be as close to the format of Fuel Agent objects as possible. It can make it easier to serialize/de-serialize objects.
Let’s go through these metadata step by step.
Each item has id field which is used to connect objects wherever they need to be connected avoiding at the same time non-trivial data hierarchies. However, id is used only for serialized set of objects. When it is a set of Python objects, device_id will be just device and it will be a Python reference to the object. id can be integer or string for sake of readability. Python objects are identified by their contents. For example, there can not be two file systems with the same mount point on a node. So, mount point can be considered as unique identifier for the file system object. Logical volumes are identified by the combination of volume group name and logical volume name.
That metadata is flat makes it easily scalable. Any plugin/extension can append or remove items. For example, the following item means we need to allocate ext2 file system with /boot mount point on device with id equal to 10.
- id: 1 type: "fs" mount: "/boot" device_id: 10 fs_type: "ext2"
- id: 5 type: "lv" vg_id: 8 name: "root" minsize: "10G" bestsize: "15G" maxsize: "50G" priority: 1000
The fields minsize, maxsize and bestsize are used to set limits and give recommendations about the size of the logical volume. The field priority is going to be used for sharing the volume group space over all logical volumes in this group. The priority is used as the weight of a particular volume. For example, if two volumes are given and we need to share the whole space between these two volumes, we can use the following algorithm:
space_1 = total_space * priority_1 / (priority_1 + priority_2) space_2 = total_space * priority_2 / (priority_1 + priority_2)
Allocation algorithm for logical volumes should look like the following:
- Allocating minimal size for each logical volume (fail if there is no enough space)
- Allocating remaining space up to recommended size for each logical volume taking into account their priorities
- Allocating remaining space up to maximal size for each logical volume taking into account their priorities. If maximal size is not set, we assume there is no such limit.
Those size limitation/recommendation/priority fields are optional. If they are not set we can use some default (0) priority and allocate remaining space for the logical volume taking into account this default priority value.
Ideally, dynamic allocation process must take into account many other parameters apart from just size of a volume. For example, we’d better avoid using SSD and HDD disks together for one volume group. Another example is we need to set file system block sized taking into account the type of hard drive, otherwise we can encounter some serious performance issues. But due to tight deadline for 7.0 let’s implement ONLY size driven allocation. Other metadata can be easily introduced later.
Another important thing is that currently Fuel Agent objects are often initalized with actual block device names (e.g. /dev/sda). But in case of dynamic allocation the actual device names are unknown when an object is instantiated. Actual block device name makes sense not earlier than the command parted is run. The correct way how to deal with this is to modify objects so as to make it possible to postpone actual device evaluation (e.g. fuel_agent/objects/device.py:Loop). In partition scheme there should not be names like /dev/sda3 until it is evaluated and actualized.
Several named sets of volume items (like those which are outlined above) can be defined and then these sets can be combined so as to define other sets. When a set defines another set as its element, then this element should be treated as a subset rather than an element. So, the resulting set is to remain flat. In the example below, Set_3 is a set of elements: Item_1, Item_2, Item_4.
Set_1: - Item_1 - Item_2 Set_2: - Item_3 Set_3: - Set_1 - Item_4
As mentioned above, every volume item is to have id field. This field is only used to connect items with each other inside a set. When a set has another set as its subset, other items id should not intersect with those in the subset. Otherwise, items with the same id will override those in the subset. It can be used if one, in fact, wants to override one or more items in the subset.
Set_1: - id: 1 type: "fs" ... - id: 2 type: "partition" ... Set_2: - Set_1 - id: 2 type: "lv" ... - id: 3 type: "vg" ...
gives Set_2 equal to:
Set_2: - id: 1 type: "fs" ... - id: 2 type: "lv" ... - id: 3 type: "vg"
Some of the sets are to be named after node role names. So, if a set has the same name as a role, then it means this set of volumes will be used for a node with this role assigned. For example, the following means ControllerRole will have three volume items: Item_1, Item_2, Item_3.
Set_1: - Item_1 - Item_2 Controller_Role: - Set_1 - Item_3
If we have several roles assigned for a node and these roles define volume items with parameters which conflict with each other, we need to be able to resolve the conflict if it is possible or report error if the conflict can’t be resolved.
Role_1: - type: "lv" name: "my_favorite_lv" vg_id: "my_favorite_vg" minsize: 10 maxsize: 30 Role_2: - type: "lv" name: "my_favorite_lv" vg_id: "my_favorite_vg" minsize: 20 maxsize: 50
In the example above describes two roles which define the same logical volume differently. Roles do not contain each other as their subsets, so, we can not override logical volume definition from one role with parameters from another. Roles don’t have priorities, they are equal in their rights to define volume items. The only way how to deal with this is to resolve this conflict.
Fortunatly, it is always possible to consider parameter intervals (continuous or enumerable) as abstract sets which can intersect with one another. If the intersection is empty, then we need to conclude those parameters are incompatible and report an error. If the intersection is not empty, then the new parameter interval is equal to the intersection. It is not always the most effective way to reconcile parameters but it is general enough to be useful for all possible cases. How we calculate the parameter intersection depends on the nature of a particular parameter.
Let’s define the following set of rules:
def minsize(minsize_1, minsize_2, maxsize_1, maxsize_2): result = max(minsize_1, minsize_2) if result > min(maxsize_1, maxsize_2): raise Exception("Incompatible parameters") return result def maxsize(maxsize_1, maxsize_2): result = min(maxsize_1, maxsize_2) if result < max(minsize_1, minsize_2): raise Exception("Incompatible parameters") return result def bestsize(bestsize_1, bestsize_2, minsize, maxsize): result = (bestsize_1 + bestsize_2) / 2.0 if result > maxsize: return maxsize elif result < minsize: return minsize else: return result def priority(priority_1, priority_2): return max(priority_1, priority_2)
We could implement volume management mechanism from scratch and fully independently from Fuel Agent. But it looks irrational avoiding using existent code and ignoring beautiful architectural concept.
Fuel Agent object model is going to be changed so as to include dynamic allocation methods and data.
Volume data in Nailgun are stored as plain json in the Node data model. As far as Nailgun volume manager will re-implemented as an extension, these volume data will be moved into extension table with foreign key to the Node.
That part of REST API which deals with volume data is going to be moved into volume manager extension.
As far as Fuel Agent is installed into bootstrap ramdisk, nodes which are booted with this ramdisk must be forced to be rebooted to make sure the newest version of Fuel Agent is available on slave nodes.
Also Fuel Agent package should be updated on the master node because Nailgun volume manager extension is going to use Fuel Agent modules.
Besides, we need to write a database migration which should create the new volume manager table and move volume data there.
In 7.0 there is no plan to expose new format for user.
Volume manager should be implemented as Fuel extension. Other plugins/extensions which need to modify volume allocation, should use volume manager extension API.
If a deployer needs specific allocation mechanism other than that is available in Fuel Agent she just needs to write her own volume manager extension implementing corresponding API. But as far as Fuel Agent allocation algorithm is going to be metadata driven, it’ll likely be possible to avoid changing the code of Fuel Agent when covering such specific cases.
After moving volume manager extension to new volume allocation format and algorithm, new system tests need to be added to cover usage of it.
New format of volumes allocation need to be described.