This blueprint is a continuation of the blueprint “100 nodes support” from release 6.0.
For large number of nodes probability of failing during provision and deploy stages is increasing. If nodes fail to provision deployment can not continue. For large environments network verification also takes a lot of time and may timeout.
In the previous release some performance tests were added to nailgun to show bottlenecks and the biggest issues were fixed. During this release more test will be added. For example:
Execution of handler ClusterChangeHandler which takes to much time will be moved to background as it is hard to optimize it.
Graphs will be added to CI job to show how performance changed between commits.
One known problem is connected with network/storage capabilities of Fuel Master node. When, during provisioning, 200 nodes simultaneously trying to fetch images and packages. Master node can not handle that high load. Astute should detect such situation and handle it. User should be also able to manually tweak astute work. For example to configure it to provision 50 nodes at the time. It will increase provisioning time but will make it more resistant. There should be a configuration option to set number nodes to deploy in one run.
Some nodes may fail because of random failures, provisioning should still continue in this case. Provision will not be restarted for failed nodes. This nodes will have status set to error. User can re-provision this nodes after successful deployment. There should be a configuration option to set percent of nodes which can fail during provisioning. User should be notified about each failure. The same applies for deploy stage.
Another problem is connected with network verification which for 100 nodes takes a lot of time. Currently connectivity between node is checked on one node at time. It should be parallelized to make it faster but also it should be backward compatible.
Depends on bottlenecks found, but unlikely.
No API changes. All optimization have to be backward compatible.
Only if database is changed, but unlikely.
If there are failed nodes. User should be informed about this.
After blueprint is implemented Fuel should be able to deploy 200 nodes.
Rules will change. Some nodes can fail now.
Blueprint will be implemented in several stages:
More load test will be added to CI infrastructure, so non optimal code can immediately be noticed.
Changes about provision and deployment should be documented.