Improve Corosync and Pacemaker management [1]

A next iteration of Corosync & Pacemaker improvements required by scaling requirements, better Pacemaker management and new OS support.

Problem description

The current Pacemaker implementation has some limitations:

  • Doesn’t allow to deploy a large amount of OpenStack Controllers
  • Operations with CIB utilizes almost 100% of CPU on the Controller
  • Corosync shutdown process takes a lot of time
  • No support of new OSes as CentOS 7 or Ubuntu 14.04
  • Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x
  • Puppet service provider for pacemaker doesn’t disable Upstart or SystemV services by default
  • At current implementation ordering between resources is not specified
  • Diff operations against Corosync CIB require to save data to file rather than keep all data in memory
  • Debug process of OCF scripts is not unified requires a lot of actions from Cloud Operator
  • Not granular enough
  • Openstack services are not managed by Pacemaker
  • Compute nodes aren’t in Pacemaker cluster, hence, are lacking a viable control plane for their’s compute/nova services.

Proposed change

  • Support Fuel Controllers with Corosync 2.0 packages
  • Get the puppet corosync module from puppetlabs and integrate it
  • Rename OCF resources. Remove __old from resource names
  • Refactor service provider and include disabling of the same services under systemd/upstart/system v
  • Refactor provider and remove diff operation from files
  • Add wrapper handler for OCF scripts or unify debug handling of OCF scripts
  • Move pacemaker & corosync installation to own stage. Create own corosync.pp to make it more granular

Permissive change:

  • Add all openstack services to pacemaker and make ordering
  • Use monit as compute nodes’ services additional control plane


All changes are not critical and doesn’t affect deployment or Cluster Operation

Data model impact


REST API impact


Upgrade impact

  • Since Resources will be renamed Upgrade process should delete old resources on upgrade and delete new resource names on roll back.
  • Corosync 2.x is NOT compatible with previous versions of Corosync (1.3/1.4). Please make sure to upgrade all nodes at once (full-downtime patching)

Security impact


Notifications impact


Other end user impact


Performance Impact

  • Deployment process will be improved and will require less time as CIB operations will not require 100% CPU time
  • Corosync 2.0 has a lot of improvements that allow to have up to 100 Controllers. Corosync 1.0 scales up to 10-16 node

Other deployer impact


Developer impact

  • Enchanced pacemaker provider requires some refactoring of puppet manifests in Fuel Library manifests:
    • Upstream corosync manifests will replace our in-memory diff invention to standard approach: crm or pcs or cibadmin –patch ‘<xml patch>’ directly.
    • Renaming vip primitives could require additional orchestration refactoring as well.
  • New Pacemaker/monit control plane for Openstack services would require appropriate changes in manifests as well.


Work Items

Mandatory items:

  • Replace Corosync 1.0 with Corosync 2.0
  • Synchronize corosync manifest with puppetlabs
  • Refactor puppet service core provider. It should:
    • Disable systemd/upstart/system V when corosync system provider is enabled
  • Redesing puppet manifests to start all OCF scripts via Wrapper

Permissive items:

  • Add openstack services to Pacemaker
  • Configure ordering between services in Pacemaker
  • Configure monit for compute nodes’ Openstack services


  • Corosync 2.x packages
  • Monit packages


  • Standard swarm testing are required.
  • Manual HA testing is required.
  • Rally testing is preffered but not mandatory.
  • New control plane for Openstack services requires manual testing.
  • New debug wrappers for OCF require manual testing.

Acceptance criteria

  • Openstack clouds deployed by Fuel are passing OSTF tests with Corosync 2.0 and new Pacemaker/monit control plane for services, if any.
  • Debug wrappers for OCF do produce enough information but aren’t too verbouse as well.
  • VIP resources do not contain an _old postfix in their names.
  • Upstart/system V control plane is disabled for services managed via Pacemaker OCF.

Documentation Impact

  • High Availability guide should be reviewed. For Ubuntu, crm tool stays as is, but documentation should be as well enhanced with pcs equivivalents for Centos
  • Upgrade/Patching impact should be described - corosync 2.0 upgrading assumes full downtime for cloud
  • Changes to OCF debugging approach with bash wrappers should be described
  • Renaming of VIP resources should be mentioned
  • In case of Openstack services become managed by Pacemaker + monit, related changes for their new control plane should be described