200 nodes support¶

https://blueprints.launchpad.net/fuel/+spec/200-nodes-support

This blueprint is a continuation of the blueprint “100 nodes support”[1] from release 6.0.

Problem description¶

For large number of nodes probability of failing during provision and deploy stages is increasing. If nodes fail to provision deployment can not continue. For large environments network verification also takes a lot of time and may timeout.

Proposed change¶

For nailgun¶

In the previous release some performance tests[2] were added to nailgun to show bottlenecks and the biggest issues were fixed. During this release more test will be added. For example:

Integration tests:

add 100 nodes, deploy, add 100 nodes, deploy
add 100 nodes, deploy cluster, stop deployment, deploy cluster

Unit tests:

Tests for handler ProvisionSelectedNodes
Tests for handler NodeGroupCollectionHandler
Tests for handler NodeCollectionNICsDefaultHandler
Check how NotificationCollectionHandler works with big number of notifications

Execution of handler ClusterChangeHandler which takes to much time will be moved to background as it is hard to optimize it.

Graphs will be added to CI job to show how performance changed between commits.

For astute¶

One known problem is connected with network/storage capabilities of Fuel Master node. When, during provisioning, 200 nodes simultaneously trying to fetch images and packages. Master node can not handle that high load. Astute should detect such situation and handle it. User should be also able to manually tweak astute work. For example to configure it to provision 50 nodes at the time. It will increase provisioning time but will make it more resistant. There should be a configuration option to set number nodes to deploy in one run.

Some nodes may fail because of random failures, provisioning should still continue in this case. Provision will not be restarted for failed nodes. This nodes will have status set to error. User can re-provision this nodes after successful deployment. There should be a configuration option to set percent of nodes which can fail during provisioning. User should be notified about each failure. The same applies for deploy stage.

Another problem is connected with network verification which for 100 nodes takes a lot of time. Currently connectivity between node is checked on one node at time. It should be parallelized to make it faster but also it should be backward compatible.

Alternatives¶

None

Data model impact¶

Depends on bottlenecks found, but unlikely.

REST API impact¶

No API changes. All optimization have to be backward compatible.

Upgrade impact¶

Only if database is changed, but unlikely.

Security impact¶

None

Notifications impact¶

If there are failed nodes. User should be informed about this.

Other end user impact¶

None

Performance Impact¶

After blueprint is implemented Fuel should be able to deploy 200 nodes.

Other deployer impact¶

Rules will change. Some nodes can fail now.

Developer impact¶

None

Implementation¶

Assignee(s)¶

Primary assignee:: loles@mirantis.com

Work Items¶

Blueprint will be implemented in several stages:

Allow to run provision in chunks
Improve network verification performance
Allow some nodes to fail during provisioning and deployment
Write new nailgun performance tests

Dependencies¶

None

Testing¶

More load test will be added to CI infrastructure, so non optimal code can immediately be noticed.

Aceptance criteria¶

Nailgun performance jobs on CI are passing
10 nodes cluster deployment succeeds even when one node failed to provision
No more than 50 nodes are simultaneously provisioned when default settings are used
Network verification does not timeout when testing 200 nodes

Documentation Impact¶

Changes about provision and deployment should be documented.

OpenStack

200 nodes support¶

Problem description¶

Proposed change¶

For nailgun¶

For astute¶

Alternatives¶

Data model impact¶

REST API impact¶

Upgrade impact¶

Security impact¶

Notifications impact¶

Other end user impact¶

Performance Impact¶

Other deployer impact¶

Developer impact¶

Implementation¶

Assignee(s)¶

Work Items¶

Dependencies¶

Testing¶

Aceptance criteria¶

Documentation Impact¶

References¶

Table Of Contents

Previous topic

Next topic

Project Source

This Page

OpenStack

200 nodes support¶

Problem description¶

Proposed change¶

For nailgun¶

For astute¶

Alternatives¶

Data model impact¶

REST API impact¶

Upgrade impact¶

Security impact¶

Notifications impact¶

Other end user impact¶

Performance Impact¶

Other deployer impact¶

Developer impact¶

Implementation¶

Assignee(s)¶

Work Items¶

Dependencies¶

Testing¶

Aceptance criteria¶

Documentation Impact¶

References¶

Table Of Contents

Previous topic

Next topic

Project Source

This Page

Quick search

Navigation