Retry of all OpenStack clients calls

https://blueprints.launchpad.net/sahara/+spec/clients-calls-retry

This specification proposes to add ability of retrying OpenStack clients calls in case of occasional errors occurrence.

Problem description

Sahara uses a bunch of OpenStack clients to communicate with other OpenStack services. Sometimes during this clients calls can be occurred occasional errors that lead to Sahara errors as well. If you make a lot of calls, it may not be surprising if one of them doesn’t respond as it should - especially for a service under heavy load.

You make a valid call and it returns a 4xx or 5xx error. You make the same call again a moment later, and it succeeds. To prevent such kind of failures, all clients calls should be retried. But retries should be done only for certain error codes, because not all of the errors can be avoided just with call repetition.

Proposed change

Swift client provides the ability of calls retry by its own. So, only number of retries and retry_on_ratelimit flag should be set during client initialisation.

Neutron client provides retry ability too, but repeats call only if ConnectionError occurred.

Nova, Cinder, Heat, Keystone clients don’t offer such functionality at all.

To retry calls execute_with_retries(method, *args, **kwargs) method will be implemented. If after execution of given method (that will be passed with first param), error occurred, its http_status will be compared with http statuses in the list of the errors, that can be retried. According to that, client call will get another chance or not.

There is a list of errors that can be retried:

  • REQUEST_TIMEOUT (408)

  • OVERLIMIT (413)

  • RATELIMIT (429)

  • INTERNAL_SERVER_ERROR (500)

  • BAD_GATEWAY (502)

  • SERVICE_UNAVAILABLE (503)

  • GATEWAY_TIMEOUT (504)

Number of times to retry the request to clients before failing will be taken from retries_number config value (5 by default).

Time between retries will be configurable (retry_after option in config) and equal to 10 seconds by default. Additionally, Nova client provides retry_after field in OverLimit and RateLimit error classes, that can be used instead of config value in this case.

These two config options will be under timeouts config group.

All clients calls will be replaced with execute_with_retries wrapper. For example, instead of the following method call

nova.client().images.get_registered_image(id)

it will be

execute_with_retries(nova.client().images.get_registered_image, id)

Alternatives

None

Data model impact

None

REST API impact

None

Other end user impact

None

Deployer impact

None

Developer impact

None

Sahara-image-elements impact

None

Sahara-dashboard / Horizon impact

None

Implementation

Assignee(s)

Primary assignee:

apavlov-n

Work Items

  • Adding new options to Sahara config;

  • execute_with_retries method implementation;

  • Replacing OpenStack clients call with execute_with_retries method.

Dependencies

None

Testing

Unit tests will be added. They will check that only specified errors will be retried

Documentation Impact

None

References

None