100 nodes support (fuel only)

https://blueprints.launchpad.net/fuel/+spec/100-nodes-support

Fuel is an enterprise tool for deploying OpenStack, it should be able to deploy large clusters. Fuel also should be fast and responsive. It does not run any processor consuming tasks, so there is no reason for it to be slow.

Problem description

  • For large number of nodes Fuel(nailgun, astute) is getting slow.
  • Probability of failing provisioning is also increasing.
  • MySQL DB works only as active/standby which has very poor performance.

Proposed change

For nailgun

In the first step, it is necessary to write tests which will show places in code which are not optimal. Some of slow parts are already known. Such tests should include(all in fake mode):

  • list 100 nodes
  • get cluster with 100 nodes
  • add 100 nodes to environment
  • remove 100 nodes from environment
  • run network verification for environment with 100 nodes
  • change settings in environment with 100 nodes
  • change network configuration in environment with 100 nodes
  • run deploy in environment with 100 nodes
  • run provision in environment with 100 nodes
  • ...

In order to detect any specific code that works slow it’s necessary to run all the above mentioned tests which measure the time of execution and compare it to specification in order to see which of them are actually slow. Run the operations under a profiler and then analyse and fix all bottlenecks, non-optimal code, etc. To measure and profile code following tools may be used:

  • cprofile - python module
  • osprofiler - python module
  • rally - testing framework

For fuelclient

There should not be any performance bottlenecks in the fuelclient, it only parses JSON data. There should be tests for fuelclient which should at least include:

  • list nodes
  • add nodes to environment
  • list environment with pending changes for 100 nodes
  • upload nodes from disk

For astute

Testing astute is harder because it includes interaction with hardware and other services like cobbler, tftp, dhcp. There is one known problem which can be addressed now. The rest of the problems can be identified after testing on real hardware.

One known problem is connected with network/storage capabilities of Fuel Master node. When, during provisioning, 100 nodes simultaneously trying to fetch images and packages. Master node can not handle that high load. Astute should detect such situation and handle it. User should be also able to manually tweak astute work. For example to configure it to provision 10 nodes at the time. It will increase provisioning time but will make it more resistant. There should be configuration option to set number nodes to deploy in one run.

Currently, if provisioning fails on one of the nodes, astute will stop the whole process. It is not an optimal solution for larger deployments. Some nodes may fail because of random failures, provisioning should still continue in this case. Provision will not be restarted for failed nodes. This nodes will be removed from cluster. User can re-add this nodes to cluster after successful deployment. There should be a configuration option to set percent of nodes which can fail during provisioning. In case when for example all controllers failed to provision, provisioning should be stopped. User should be notified about each failure.

For UI

Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000 nodes, it will require some speed improvements.

For puppet manifests library

Configure HAproxy MySQL backends as active/active. There is a patch https://review.openstack.org/#/c/124549/ addressing this change, but it requires additional researching and load testing.

Alternatives

None

Data model impact

Depends on bottlenecks found, but unlikely.

REST API impact

No API changes. All optimization have to be backward compatible.

Upgrade impact

Only if database is changed, but unlikely.

Security impact

None

Notifications impact

If there are failed nodes. User should be informed about this.

Other end user impact

None

Performance Impact

After blueprint is implemented Fuel should be able to deploy 100 nodes. Active/active load balancing for MySQL connections should improve DB operations.

Other deployer impact

Rules will change. Some nodes can fail now.

Developer impact

None

Implementation

Assignee(s)

Primary assignee:
loles@mirantis.com ksambor@mirantis.com

Work Items

Blueprint will be implemented in several stages:

  • In first stage all tests will be written.
  • In next stage all known and discovered bottlenecks will be fixed.
  • After this tests will be run in virtual environment which can create 100 nodes.
  • At the end tests will be run in lab with 100 physical nodes. This test should show us all astute bottlenecks.
  • To prevent reintroducing bottlenecks in next releases all test will be integrated with our CI infrastructure.
  • Additional integration with OSProfiler. It can help find bottleneck in production systems
  • Additional integration with Rally. It will help to test Fuel in real live environment.
  • Additional Neutron load testing with Rally in HA for active/active MySQL. Even if active/active will fail the testing, at least we could play with tuning related params and provide some output to community.

Dependencies

None

Testing

When all bottlenecks are fixed, load test will be added to CI infrastructure, so non optimal code can immediately be noticed.

Documentation Impact

Deployment rules will change, it should be documented. New notifications should be described. Active/active mode for MySQL should be documented.