https://blueprints.launchpad.net/fuel/+spec/100-nodes-support
Fuel is an enterprise tool for deploying OpenStack, it should be able to deploy large clusters. Fuel also should be fast and responsive. It does not run any processor consuming tasks, so there is no reason for it to be slow.
In the first step, it is necessary to write tests which will show places in code which are not optimal. Some of slow parts are already known. Such tests should include(all in fake mode):
In order to detect any specific code that works slow it’s necessary to run all the above mentioned tests which measure the time of execution and compare it to specification in order to see which of them are actually slow. Run the operations under a profiler and then analyse and fix all bottlenecks, non-optimal code, etc. To measure and profile code following tools may be used:
There should not be any performance bottlenecks in the fuelclient, it only parses JSON data. There should be tests for fuelclient which should at least include:
Testing astute is harder because it includes interaction with hardware and other services like cobbler, tftp, dhcp. There is one known problem which can be addressed now. The rest of the problems can be identified after testing on real hardware.
One known problem is connected with network/storage capabilities of Fuel Master node. When, during provisioning, 100 nodes simultaneously trying to fetch images and packages. Master node can not handle that high load. Astute should detect such situation and handle it. User should be also able to manually tweak astute work. For example to configure it to provision 10 nodes at the time. It will increase provisioning time but will make it more resistant. There should be configuration option to set number nodes to deploy in one run.
Currently, if provisioning fails on one of the nodes, astute will stop the whole process. It is not an optimal solution for larger deployments. Some nodes may fail because of random failures, provisioning should still continue in this case. Provision will not be restarted for failed nodes. This nodes will be removed from cluster. User can re-add this nodes to cluster after successful deployment. There should be a configuration option to set percent of nodes which can fail during provisioning. In case when for example all controllers failed to provision, provisioning should be stopped. User should be notified about each failure.
Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000 nodes, it will require some speed improvements.
Configure HAproxy MySQL backends as active/active. There is a patch https://review.openstack.org/#/c/124549/ addressing this change, but it requires additional researching and load testing.
None
Depends on bottlenecks found, but unlikely.
No API changes. All optimization have to be backward compatible.
Only if database is changed, but unlikely.
None
If there are failed nodes. User should be informed about this.
None
After blueprint is implemented Fuel should be able to deploy 100 nodes. Active/active load balancing for MySQL connections should improve DB operations.
Rules will change. Some nodes can fail now.
None
Blueprint will be implemented in several stages:
None
When all bottlenecks are fixed, load test will be added to CI infrastructure, so non optimal code can immediately be noticed.
Deployment rules will change, it should be documented. New notifications should be described. Active/active mode for MySQL should be documented.