Volume manager refactoring

https://blueprints.launchpad.net/fuel/+spec/volume-manager-refactoring

Currently nailgun volume manager is not flexible and customizable enough to address many needs of users. For example, some users want some volumes to be untouched during OS provisioning, some users want it to be possible to deploy software RAIDs or configure FS mount options, etc.

Problem description

There are use cases which aren’t covered with the fuctionality of current implementation of volume manager.

These use cases include at least the following:

  • Volume preservation

    Sometimes when a node is going to be re-provisioned there could be volumes (partitions, logical volumes, MD devices) which user wants to remain untouched.

  • FS mount options

    Sometimes user needs to mount some file systems using specific options, like noatime or ro, etc.

  • Bootable disks

    Currently we install bootloader on all hard drives, but it does not always correspond to what user wants.

  • Flexible partitioning scheme

    Currently we have predefined partition scheme which assumes, for example, that we put root file system on logical volume and that we don’t create separate file system for /var. These assumptions limit users in their abilities to create a partition scheme they might want.

  • Pluggable partitioning scheme

    Some Fuel plugins assume we need to have additional partitions on node. Currently, plugin partitions can conflict with existent partitions and we need to resolve these potential conflicts.

Proposed change

Provisioning process in general can be considered as the following set of steps:

+---------+    +----------+    +-----------+    +-----------+
|         |    |          |    |           |    |           |
|discovery+--> |allocation+--> |OS building+--> |OS copying |
|         |    |          |    |           |    |           |
+---------+    +----------+    +-----------+    +-----------+

All those main steps can be implemented in a monolithic manner or they can be a set of separable modules/plugins/extensions.

  1. Discovery

    This step is when we try to find out which hard drives are available on a node. Anaconda and debian-installer do the same at the very beginning of provisioning process. In our case this step is implemented as a separate service which is called Nailgun Agent.

  2. Allocation

    On this step the default partitioning scheme is generated. This allocation step can be data driven when, for example, a user of a provisioning agent defines which file systems she needs to create and their priorities but not their exact sizes. Again anaconda and debian-installer do the same using some default hard coded or user defined (kickstart/preseed) partitioning metadata. In our case it is implemented as the volume manager module in Nailgun.

  3. OS building

    On this step OS is built from scratch using packages repositories or any other available mechanisms. Anaconda builds OS using rpm packages and yum. Debian-installer uses deb packages and debootstrap. In terms of Fuel this step is exactly what we call OS image building. In contrast to anaconda and debian-installer we build OS just once somewhere on the master node or on a developer node during ISO building. We then just copy this pre-built OS image on all provisioned nodes. This step indirectly depends on the previous step (step 2) because a user might be potentially interested in assigning some specific options for a particular file system. Step 2 (allocation) is exactly the place where we define which partitions and file systems we need. OS building (or equivalently OS image building) being implemented in the scope of Fuel Agent can be potentially run on the slave node if, for example, this node requires specific file system options.

  4. OS copying

    This step makes sense only for image based approach when we build OS remotely. For example, anaconda and debian-installer build OS right on the file system where it is going to live on a node.

Anaconda and debian-installer implement these four steps in a monolithic manner. For example, we can not separate OS building step from the whole provisioning process. In case of Fuel all these steps implemented as separate components. Currently, Fuel Agent implements steps 3 and 4, but it looks like Fuel Agent is the right place where to implement also steps 1 and 2 [1]. This spec does not concern step 1. Re-implementing the functionality of Nailgun Agent in the scope of Fuel Agent is a deal of a separate feature. This spec is totally about step 2.

The suggestion is to implement dynamic volume allocation on Volume Manager side keeping as much code as possible in Fuel Agent and reusing it. The motivation behind is:

  • Fuel Agent already has quite detailed partitioning object model fuel_agent/objects/partition.py which just needs to be developed so as to support dynamic allocation over existent hard drives on a node.
  • Allocation scheme can influence steps 3 and 4. So, it is much easier to deal with the whole provisioning process when it is totally implemeted in terms of one modular component.
  • Being quite independent Fuel Agent can be used w/o Fuel. And it would be great to make it able to dynamically allocate volumes when it is used out of Fuel.
  • In the future we will need to allocate volumes not only basing on their size but also taking into account disk types and other parameters. And it is going to be much easier to introduce those parameters in the scope of Fuel Agent object model.

On the other hand we are moving towards modular Fuel architecture, so, it looks like it is the place where we can start putting our efforts towards modularisation. The suggestion is to implement current volume manager in nailgun as extension. Being installed this volume manager extension imports Fuel Agent code in order to generate volume allocation (metadata/UI driven). The default volume allocation should be configurable via allocation metadata. A user then can modify this default allocation on the disk management tab on UI. If other extensions (ceph or mongo, etc.) need to modify volume allocation scheme they need to use volume manager extension for this and they need to interact with it only via its API.

So, the feature can be considered as two independent tasks:

  1. Convert Nailgun volume manager into Nailgun volume manager extension
  2. Implement dynamic volume allocation procedure in the scope of Fuel Agent and introduce this functionality into Nailgun volume manager extension importing necessary modules from Fuel Agent.

The coverage scheme then will be as follows:

+-------------------------+    +----------------------------+
|Nailgun & vol. extension |    | Fuel Agent                 |
+-------------------------+    +----------------------------+
+---------+    +----------+    +-----------+    +-----------+
|         |    |          |    |           |    |           |
|discovery+--> |allocation+--> |OS building+--> |OS copying |
|         |    |          |    |           |    |           |
+---------+    +----------+    +-----------+    +-----------+

More detailed scheme how it will work:

    VolumeManager
+--------------------+
| +----------------+ |
| |    objects     | |
| |(from fuel_agent| |
| | import objects)| |
| +----------------+ |
|                    |   +-----------+
|   +------------+   +--->           |
|   | new volumes|   |   |  nailgun  |
|   | allocation |   <---+           |
|   | algorithm  |   |   +-----------+
|   +------------+   |
+---------+----------+
          |
          |  +---------------+
          |  |   serialize   |
          |  | ready to use  |
          |  |PartitionScheme|
          |  +---------------+
          |
          |     fuel_agent
          |
+---------v-------------------------+
| +---------+    +----------------+ |
| | objects |    |  partitioning  | |
| +---------+    |  provisioning  | |
|                +----------------+ |
|                                   |
| +-------------------------------+ |
| |         NEW DataDriver        | |
| |      (deserialize obtained    | |
| |         PartitionScheme)      | |
| +-------------------------------+ |
+-----------------------------------+

New volumes allocation algorithm will be implement first in terms of Fuel Agent and then used (imported but not moved) in volume manager.

Dynamic allocation

Dynamic allocation metadata could look like (exact format will be found during actual implementation):

- id: 1
  type: "fs"
  mount: "/boot"
  device_id: 9
  fs_type: "ext2"

- id: 2
  type: "fs"
  mount: "/"
  device_id: 5
  fs_type: "ext4"

- id: 3
  type: "fs"
  mount: "swap"
  device_id: 6
  fs_type: "swap"

- id: 4
  type: "fs"
  device_id: 7
  mount: "/var/lib/mysql"
  fs_type: "ext4"
  block_size: "4K"

- id: 5
  type: "lv"
  vg_id: 8
  name: "root"
  minsize: "10G"
  bestsize: "15G"
  priority: 1000

- id: 6
  type: "lv"
  vg_id: 8
  minsize: "1G"
  maxsize: "8G"
  priority: 200
  name: "swap"

- id: 7
  type: "partition"
  minsize: "20G"
  device_id: __auto__

- id: 8
  type: "vg"
  name: "os"
  minsize: __auto__
  pvs_id: __auto__

- id: 9
  type: "md"
  level: "mirror"
  minsize: "200M"
  maxsize: "400M"
  bestsize: "200M"
  numactive: 2
  numspares: 1
  devices_id: __auto__
  spares_id: __auto__

The format of these metadata should be as close to the format of Fuel Agent objects as possible. It can make it easier to serialize/de-serialize objects.

Let’s go through these metadata step by step.

  1. Each item has id field which is used to connect objects wherever they need to be connected avoiding at the same time non-trivial data hierarchies. However, id is used only for serialized set of objects. When it is a set of Python objects, device_id will be just device and it will be a Python reference to the object. id can be integer or string for sake of readability. Python objects are identified by their contents. For example, there can not be two file systems with the same mount point on a node. So, mount point can be considered as unique identifier for the file system object. Logical volumes are identified by the combination of volume group name and logical volume name.

    That metadata is flat makes it easily scalable. Any plugin/extension can append or remove items. For example, the following item means we need to allocate ext2 file system with /boot mount point on device with id equal to 10.

- id: 1
  type: "fs"
  mount: "/boot"
  device_id: 10
  fs_type: "ext2"
  1. Logical volume items have vg field which identifies volume group where a logical volume is to be placed.
- id: 5
  type: "lv"
  vg_id: 8
  name: "root"
  minsize: "10G"
  bestsize: "15G"
  maxsize: "50G"
  priority: 1000

The fields minsize, maxsize and bestsize are used to set limits and give recommendations about the size of the logical volume. The field priority is going to be used for sharing the volume group space over all logical volumes in this group. The priority is used as the weight of a particular volume. For example, if two volumes are given and we need to share the whole space between these two volumes, we can use the following algorithm:

space_1 = total_space * priority_1 / (priority_1 + priority_2)
space_2 = total_space * priority_2 / (priority_1 + priority_2)

Allocation algorithm for logical volumes should look like the following:

  • Allocating minimal size for each logical volume (fail if there is no enough space)
  • Allocating remaining space up to recommended size for each logical volume taking into account their priorities
  • Allocating remaining space up to maximal size for each logical volume taking into account their priorities. If maximal size is not set, we assume there is no such limit.

Those size limitation/recommendation/priority fields are optional. If they are not set we can use some default (0) priority and allocate remaining space for the logical volume taking into account this default priority value.

  1. Volume group can also have minsize, maxsize, bestsize and priority fields which are to be used exactly the same way as in case of logical volumes. If minsize is equal to __auto__ then it means it should be calculated as a sum of minimal sizes of all logical volumes in the volume group. The field pvs should define a set of physical volume identifiers which constitute the volume group. If this field is equal to __auto__ then it means we should define physical volumes dynamically during allocation. For example, we need to allocate 100G for the volume group, and there are two disks on the node partly allocated for other volume groups and partitions. Let’s say there is 50G of free space on the first disk and 50G of free space on the second disk. So, two physical volumes (50G each) will be allocated for the volume group.
  2. Plain partition can have the same limitation/recommendation fields minsize, maxsize, bestsize, priority and these fields have the same meaning. It is necessary to note that unlike volume groups, plain partitions can not be split into parts (physical volumes). So, plain partitions should be allocated before volume groups and then the remaining free space can be flexibly used for volume groups.
  3. MD device has the same dynamic allocation fields, but the trick here is that need to allocate several partitions for one MD device and these partitions are to be located on different hard drives.

Ideally, dynamic allocation process must take into account many other parameters apart from just size of a volume. For example, we’d better avoid using SSD and HDD disks together for one volume group. Another example is we need to set file system block sized taking into account the type of hard drive, otherwise we can encounter some serious performance issues. But due to tight deadline for 7.0 let’s implement ONLY size driven allocation. Other metadata can be easily introduced later.

Another important thing is that currently Fuel Agent objects are often initalized with actual block device names (e.g. /dev/sda). But in case of dynamic allocation the actual device names are unknown when an object is instantiated. Actual block device name makes sense not earlier than the command parted is run. The correct way how to deal with this is to modify objects so as to make it possible to postpone actual device evaluation (e.g. fuel_agent/objects/device.py:Loop). In partition scheme there should not be names like /dev/sda3 until it is evaluated and actualized.

Volume sets, roles and compatibility

Several named sets of volume items (like those which are outlined above) can be defined and then these sets can be combined so as to define other sets. When a set defines another set as its element, then this element should be treated as a subset rather than an element. So, the resulting set is to remain flat. In the example below, Set_3 is a set of elements: Item_1, Item_2, Item_4.

Set_1:
  - Item_1
  - Item_2
Set_2:
  - Item_3
Set_3:
  - Set_1
  - Item_4

As mentioned above, every volume item is to have id field. This field is only used to connect items with each other inside a set. When a set has another set as its subset, other items id should not intersect with those in the subset. Otherwise, items with the same id will override those in the subset. It can be used if one, in fact, wants to override one or more items in the subset.

For example:

Set_1:
  - id: 1
    type: "fs"
    ...
  - id: 2
    type: "partition"
    ...
Set_2:
  - Set_1
  - id: 2
    type: "lv"
    ...
  - id: 3
    type: "vg"
    ...

gives Set_2 equal to:

Set_2:
  - id: 1
    type: "fs"
    ...
  - id: 2
    type: "lv"
    ...
  - id: 3
    type: "vg"

Some of the sets are to be named after node role names. So, if a set has the same name as a role, then it means this set of volumes will be used for a node with this role assigned. For example, the following means ControllerRole will have three volume items: Item_1, Item_2, Item_3.

Set_1:
  - Item_1
  - Item_2
Controller_Role:
  - Set_1
  - Item_3

If we have several roles assigned for a node and these roles define volume items with parameters which conflict with each other, we need to be able to resolve the conflict if it is possible or report error if the conflict can’t be resolved.

Role_1:
  - type: "lv"
    name: "my_favorite_lv"
    vg_id: "my_favorite_vg"
    minsize: 10
    maxsize: 30
Role_2:
  - type: "lv"
    name: "my_favorite_lv"
    vg_id: "my_favorite_vg"
    minsize: 20
    maxsize: 50

In the example above describes two roles which define the same logical volume differently. Roles do not contain each other as their subsets, so, we can not override logical volume definition from one role with parameters from another. Roles don’t have priorities, they are equal in their rights to define volume items. The only way how to deal with this is to resolve this conflict.

Fortunatly, it is always possible to consider parameter intervals (continuous or enumerable) as abstract sets which can intersect with one another. If the intersection is empty, then we need to conclude those parameters are incompatible and report an error. If the intersection is not empty, then the new parameter interval is equal to the intersection. It is not always the most effective way to reconcile parameters but it is general enough to be useful for all possible cases. How we calculate the parameter intersection depends on the nature of a particular parameter.

Let’s define the following set of rules:

def minsize(minsize_1, minsize_2, maxsize_1, maxsize_2):
  result = max(minsize_1, minsize_2)
  if result > min(maxsize_1, maxsize_2):
    raise Exception("Incompatible parameters")
  return result

def maxsize(maxsize_1, maxsize_2):
  result = min(maxsize_1, maxsize_2)
  if result < max(minsize_1, minsize_2):
    raise Exception("Incompatible parameters")
  return result

def bestsize(bestsize_1, bestsize_2, minsize, maxsize):
  result = (bestsize_1 + bestsize_2) / 2.0
  if result > maxsize:
    return maxsize
  elif result < minsize:
    return minsize
  else:
    return result

def priority(priority_1, priority_2):
  return max(priority_1, priority_2)

Alternatives

We could implement volume management mechanism from scratch and fully independently from Fuel Agent. But it looks irrational avoiding using existent code and ignoring beautiful architectural concept.

Data model impact

Fuel Agent object model is going to be changed so as to include dynamic allocation methods and data.

Volume data in Nailgun are stored as plain json in the Node data model. As far as Nailgun volume manager will re-implemented as an extension, these volume data will be moved into extension table with foreign key to the Node.

REST API impact

That part of REST API which deals with volume data is going to be moved into volume manager extension.

Upgrade impact

As far as Fuel Agent is installed into bootstrap ramdisk, nodes which are booted with this ramdisk must be forced to be rebooted to make sure the newest version of Fuel Agent is available on slave nodes.

Also Fuel Agent package should be updated on the master node because Nailgun volume manager extension is going to use Fuel Agent modules.

Besides, we need to write a database migration which should create the new volume manager table and move volume data there.

Security impact

None

Notifications impact

None

Other end user impact

In 7.0 there is no plan to expose new format for user.

Performance Impact

None

Plugin impact

Volume manager should be implemented as Fuel extension. Other plugins/extensions which need to modify volume allocation, should use volume manager extension API.

Other deployer impact

If a deployer needs specific allocation mechanism other than that is available in Fuel Agent she just needs to write her own volume manager extension implementing corresponding API. But as far as Fuel Agent allocation algorithm is going to be metadata driven, it’ll likely be possible to avoid changing the code of Fuel Agent when covering such specific cases.

Developer impact

None

Infrastructure impact

None

Implementation

Work Items

  1. Implement Nailgun volume manager extension
  2. Implement dynamic volume allocation in the scope of Fuel Agent
  3. Use new dynamic volume allocation in volume manager extension

Dependencies

None

Testing

After moving volume manager extension to new volume allocation format and algorithm, new system tests need to be added to cover usage of it.

Acceptance criteria

  • Current functionality works as usual with no regressions until it described by the spec.
  • Volume preservation: ability to reserve partition as untouched while re-provisioning.
  • FS mount options: ability to specify different mount options for particular partionions.
  • Bootable disks: ability to choose what hardrives should contain bootloader.
  • Flexible partitioning scheme: ability to create various partition schemes.
  • Pluggable partitioning scheme: ability for plugins to create own partitions without conflicts.

Documentation Impact

New format of volumes allocation need to be described.

References

[1]In fact, Fuel Agent currently implements discovery functionality but only for block devices (hard drives) and it is not compatible with Nailgun. So, if it is necessary, Fuel Agent is able to get the information about available hard drives on a node totally on its own.