Improve Corosync and Pacemaker management
https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements
A next iteration of Corosync & Pacemaker improvements required by scaling
requirements, better Pacemaker management and new OS support.
Problem description
The current Pacemaker implementation has some limitations:
- Doesn’t allow to deploy a large amount of OpenStack Controllers
- Operations with CIB utilizes almost 100% of CPU on the Controller
- Corosync shutdown process takes a lot of time
- No support of new OSes as CentOS 7 or Ubuntu 14.04
- Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x
- Puppet service provider for pacemaker doesn’t disable Upstart or SystemV
services by default
- At current implementation ordering between resources is not specified
- Diff operations against Corosync CIB require to save data to file rather
than keep all data in memory
- Debug process of OCF scripts is not unified requires a lot of actions from
Cloud Operator
- Not granular enough
- Openstack services are not managed by Pacemaker
- Compute nodes aren’t in Pacemaker cluster, hence, are lacking a viable
control plane for their’s compute/nova services.
Proposed change
- Support Fuel Controllers with Corosync 2.0 packages
- Get the puppet corosync module from puppetlabs and integrate it
- Rename OCF resources. Remove __old from resource names
- Refactor service provider and include disabling of the same services under
systemd/upstart/system v
- Refactor provider and remove diff operation from files
- Add wrapper handler for OCF scripts or unify debug handling of OCF scripts
- Move pacemaker & corosync installation to own stage. Create own corosync.pp
to make it more granular
Permissive change:
- Add all openstack services to pacemaker and make ordering
- Use monit as compute nodes’ services additional control plane
Alternatives
All changes are not critical and doesn’t affect deployment or Cluster
Operation
Upgrade impact
- Since Resources will be renamed Upgrade process should delete old resources
on upgrade and delete new resource names on roll back.
- Corosync 2.x is NOT compatible with previous versions of Corosync (1.3/1.4).
Please make sure to upgrade all nodes at once (full-downtime patching)
Notifications impact
None
Other end user impact
None
Other deployer impact
None
Developer impact
- Enchanced pacemaker provider requires some refactoring of puppet manifests
in Fuel Library manifests:
- Upstream corosync manifests will replace our in-memory diff invention to
standard approach: crm or pcs or cibadmin –patch ‘<xml patch>’ directly.
- Renaming vip primitives could require additional orchestration refactoring
as well.
- New Pacemaker/monit control plane for Openstack services would require
appropriate changes in manifests as well.
Implementation
Work Items
Mandatory items:
- Replace Corosync 1.0 with Corosync 2.0
- Synchronize corosync manifest with puppetlabs
- Refactor puppet service core provider. It should:
- Disable systemd/upstart/system V when corosync system
provider is enabled
- Redesing puppet manifests to start all OCF scripts via
Wrapper
Permissive items:
- Add openstack services to Pacemaker
- Configure ordering between services in Pacemaker
- Configure monit for compute nodes’ Openstack services
Dependencies
- Corosync 2.x packages
- Monit packages
Testing
- Standard swarm testing are required.
- Manual HA testing is required.
- Rally testing is preffered but not mandatory.
- New control plane for Openstack services requires manual testing.
- New debug wrappers for OCF require manual testing.
Acceptance criteria
- Openstack clouds deployed by Fuel are passing OSTF tests with
Corosync 2.0 and new Pacemaker/monit control plane for services,
if any.
- Debug wrappers for OCF do produce enough information but aren’t too
verbouse as well.
- VIP resources do not contain an _old postfix in their names.
- Upstart/system V control plane is disabled for services managed via
Pacemaker OCF.
Documentation Impact
- High Availability guide should be reviewed. For Ubuntu, crm tool stays
as is, but documentation should be as well enhanced with pcs
equivivalents for Centos
- Upgrade/Patching impact should be described - corosync 2.0 upgrading
assumes full downtime for cloud
- Changes to OCF debugging approach with bash wrappers should be described
- Renaming of VIP resources should be mentioned
- In case of Openstack services become managed by Pacemaker + monit, related
changes for their new control plane should be described
References