Asynchronous Container Operations

Launchpad blueprint:

https://blueprints.launchpad.net/magnum/+spec/async-container-operations

At present, container operations are done in a synchronous way, end-to-end. This model does not scale well, and incurs a penalty on the client to be stuck till the end of completion of the operation.

Problem Description

At present Magnum-Conductor executes the container operation as part of processing the request forwarded from Magnum-API. For container-create, if the image needs to be pulled down, it may take a while depending on the responsiveness of the registry, which can be a substantial delay. At the same time, experiments suggest that even for pre-pulled image, the time taken by each operations, namely create/start/delete, are in the same order, as it involves complete turn around between the magnum-client and the COE-API, via Magnum-API and Magnum-Conductor[1].

Use Cases

For wider enterprise adoption of Magnum, we need it to scale better. For that we need to replace some of these synchronous behaviors with suitable alternative of asynchronous implementation.

To understand the use-case better, we can have a look at the average time spent during container operations, as noted at[1].

Proposed Changes

The design has been discussed over the ML[6]. The conclusions have been kept on the ‘whiteboard’ of the Blueprint.

The amount of code change is expected to be significant. To ease the process of adoption, code review, functional tests, an approach of phased implementation may be required. We can define the scope of the three phases of the implementation as follows -

  • Phase-0 will bring in the basic feature of asynchronous mode of operation in Magnum - (A) from API to Conductor and (B) from Conductor to COE-API. During phase-0, this mode will be optional through configuration.

    Both the communications of (A) and (B) are proposed to be made asynchronous to achieve the best of it. If we do (A) alone, it does not gain us much, as (B) takes up the higher cycles of the operation. If we do (B) alone, it does not make sense, as (A) will synchronously wait for no meaningful data.

  • Phase-1 will concentrate on making the feature persistent to address various scenarios of conductor restart, worker failure etc. We will support this feature for multiple Conductor-workers in this phase.

  • Phase-2 will select asynchronous mode of operation as the default mode. At the same time, we can evaluate to drop the code for synchronous mode, too.

Phase-0 is required as a meaningful temporary step, to establish the importance and tangible benefits of phase-1. This is also to serve as a proof-of-concept at a lower cost of code changes with a configurable option. This will enable developers and operators to have a taste of the feature, before bringing in the heavier dependencies and changes proposed in phase-1.

A reference implementation for the phase-0 items, has been put for review[2].

Following is the summary of the design -

1. Configurable mode of operation - async

For ease of adoption, the async_mode of communication between API-conductor, conductor-COE in magnum, can be controlled using a configuration option. So the code-path for sync mode and async mode would co-exist for now. To achieve this with minimal/no code duplication and cleaner interface, we are using openstack/futurist[4]. Futurist interface hides the details of type of executor being used. In case of async configuration, a greenthreadpool of configured poolsize gets created. Here is a sample of how the config would look like:

[DEFAULT]
async_enable = False

[conductor]
async_threadpool_max_workers = 64

Futurist library is used in oslo.messaging. Thus, it is used by almost all OpenStack projects, in effect. Futurist is very useful to run same code under different execution model and hence saving potential duplication of code.

2. Type of operations

There are two classes of container operations - one that can be made async, namely create/delete/start/stop/pause/unpause/reboot, which do not need data about the container in return. The other type requires data, namely container-logs. For async-type container-operations, magnum-API will be using ‘cast’ instead of ‘call’ from oslo_messaging[5].

‘cast’ from oslo.messaging.rpcclient is used to invoke a method and return immediately, whereas ‘call’ invokes a method and waits for a reply. While operating in asynchronous mode, it is intuitive to use cast method, as the result of the response may not be available immediately.

Magnum-api first fetches the details of a container, by doing ‘get_rpc_resource’. This function uses magnum objects. Hence, this function uses a ‘call’ method underneath. Once, magnum-api gets back the details, it issues the container operation next, using another ‘call’ method. The above proposal is to replace the second ‘call’ with ‘cast’.

If user issues a container operation, when there is no listening conductor (because of process failure), there will be a RPC timeout at the first ‘call’ method. In this case, user will observe the request to get blocked at client and finally fail with HTTP 500 ERROR, after the RPC timeout, which is 60 seconds by default. This behavior is independent of the usage of ‘cast’ or ‘call’ for the second message, mentioned above. This behavior does not influence our design, but it is documented here for clarity of understanding.

3. Ensuring the order of execution - Phase-0

Magnum-conductor needs to ensure that for a given bay and given container, the operations are executed in sequence. In phase-0, we want to demonstrate how asynchronous behavior helps scaling. Asynchronous mode of container operations would be supported for single magnum-conductor scenario, in phase-0. If magnum-conductor crashes, there will be no recovery for the operations accepted earlier - which means no persistence in phase-0, for operations accepted by magnum-conductor. Multiple conductor scenario and persistence will be addressed in phase-1 [please refer to the next section for further details]. If COE crashes or does not respond, the error will be detected, as it happens in sync mode, and reflected on the container-status.

Magnum-conductor will maintain a job-queue. Job-queue is indexed by bay-id and container-id. A job-queue entry would contain the sequence of operations requested for a given bay-id and container-id, in temporal order. A greenthread will execute the tasks/operations in order for a given job-queue entry, till the queue empties. Using a greenthread in this fashion saves us from the cost and complexity of locking, along with functional correctness. When request for new operation comes in, it gets appended to the corresponding queue entry.

For a sequence of container operations, if an intermediate operation fails, we will stop continuing the sequence. The community feels more confident to start with this strictly defensive policy[17]. The failure will be logged and saved into the container-object, which will help an operator be informed better about the result of the sequence of container operations. We may revisit this policy later, if we think it is too restrictive.

4. Ensuring the order of execution - phase-1

The goal is to execute requests for a given bay and a given container in sequence. In phase-1, we want to address persistence and capability of supporting multiple magnum-conductor processes. To achieve this, we will reuse the concepts laid out in phase-0 and use a standard library.

We propose to use taskflow[7] for this implementation. Magnum-conductors will consume the AMQP message and post a task[8] on a taskflow jobboard[9]. Greenthreads from magnum-conductors would subscribe to the taskflow jobboard as taskflow-conductors[10]. Taskflow jobboard is maintained with a choice of persistent backend[11]. This will help address the concern of persistence for accepted operations, when a conductor crashes. Taskflow will ensure that tasks, namely container operations, in a job, namely a sequence of operations for a given bay and container, would execute in sequence. We can easily notice that some of the concepts used in phase-0 are reused as it is. For example, job-queue maps to jobboard here, use of greenthread maps to the conductor concept of taskflow. Hence, we expect easier migration from phase-0 to phase-1, with the choice of taskflow.

For taskflow jobboard[11], the available choices of backend are Zookeeper and Redis. But, we plan to use MySQL as default choice of backend, for magnum conductor jobboard use-case. This support will be added to taskflow. Later, we may choose to support the flexibility of other backends like ZK/Redis via configuration. But, phase-1 will keep the implementation simple with MySQL backend and revisit this, if required.

Let’s consider the scenarios of Conductor crashing -
  • If a task is added to jobboard, and conductor crashes after that, taskflow can assign a particular job to any available greenthread agents from other conductor instances. If the system was running with single magnum-conductor, it will wait for the conductor to come back and join.

  • A task is picked up and magnum-conductor crashes. In this case, the task is not complete from jobboard point-of-view. As taskflow detects the conductor going away, it assigns another available conductor.

  • When conductor picks up a message from AMQP, it will acknowledge AMQP, only after persisting it to jobboard. This will prevent losing the message, if conductor crashes after picking up the message from AMQP. Explicit acknowledgement from application may use NotificationResult.HANDLED[12] to AMQP. We may use the at-least-one-guarantee[13] feature in oslo.messaging[14], as it becomes available.

To summarize some of the important outcomes of this proposal -
  • A taskflow job represents the sequence of container operations on a given bay and given container. At a given point of time, the sequence may contain a single or multiple operations.

  • There will be a single jobboard for all conductors.

  • Task-flow conductors are multiple greenthreads from a given magnum-conductor.

  • Taskflow-conductor will run in ‘blocking’ mode[15], as those greenthreads have no other job than claiming and executing the jobs from jobboard.

  • Individual jobs are supposed to maintain a temporal sequence. So the taskflow-engine would be ‘serial’[16].

  • The proposed model for a ‘job’ is to consist of a temporal sequence of ‘tasks’ - operations on a given bay and a given container. Henceforth, it is expected that when a given operation, namely container-create is in progress, a request for container-start may come in. Adding the task to the existing job is intuitive to maintain the sequence of operations.

To fit taskflow exactly into our use-case, we may need to do two enhancements in taskflow - - Supporting mysql plugin as a DB backend for jobboard. Support for redis exists, so it will be similar. We do not see any technical roadblock for adding mysql support for taskflow jobboard. If the proposal does not get approved by taskflow team, we may have to use redis, as an alternative option. - Support for dynamically adding tasks to a job on jobboard. This also looks feasible, as discussed over the #openstack-state-management [Unfortunately, this channel is not logged, but if we agree in this direction, we can initiate discussion over ML, too] If taskflow team does not allow adding this feature, even though they have agreed now, we will use the dependency feature in taskflow. We will explore and elaborate this further, if it requires.

5. Status of progress

The progress of execution of a container operation is reflected on the status of a container as - ‘create-in-progress’, ‘delete-in-progress’ etc.

Alternatives

Without an asynchronous implementation, Magnum will suffer from complaints about poor scalability and slowness.

In this design, stack-lock[3] has been considered as an alternative to taskflow. Following are the reasons for preferring taskflow over stack-lock, as of now, - Stack-lock used in Heat is not a library, so it will require making a copy for Magnum, which is not desirable. - Taskflow is relatively mature, well supported, feature-rich library. - Taskflow has in-built capacity to scale out[in] as multiple conductors can join in[out] the cluster. - Taskflow has a failure detection and recovery mechanism. If a process crashes, then worker threads from other conductor may continue the execution.

In this design, we describe futurist[4] as a choice of implementation. The choice was to prevent duplication of code for async and sync mode. For this purpose, we could not find any other solution to compare.

Data model impact

Phase-0 has no data model impact. But phase-1 may introduce an additional table into the Magnum database. As per the present proposal for using taskflow in phase-1, we have to introduce a new table for jobboard under magnum db. This table will be exposed to taskflow library as a persistent db plugin. Alternatively, an implementation with stack-lock will also require an introduction of a new table for stack-lock objects.

REST API impact

None.

Security impact

None.

Notifications impact

None

Other end user impact

None

Performance impact

Asynchronous mode of operation helps in scalability. Hence, it improves responsiveness and reduces the turn around time in a significant proportion. A small test on devstack, comparing both the modes, demonstrate this with numbers.[1]

Other deployer impact

None.

Developer impact

None

Implementation

Assignee(s)

Primary assignee

suro-patz(Surojit Pathak)

Work Items

For phase-0 * Introduce config knob for asynchronous mode of container operations.

  • Changes for Magnum-API to use CAST instead of CALL for operations eligible for asynchronous mode.

  • Implement the in-memory job-queue in Magnum conductor, and integrate futurist library.

  • Unit tests and functional tests for async mode.

  • Documentation changes.

For phase-1 * Get the dependencies on taskflow being resolved.

  • Introduce jobboard table into Magnum DB.

  • Integrate taskflow in Magnum conductor to replace the in-memory job-queue with taskflow jobboard. Also, we need conductor greenthreads to subscribe as workers to the taskflow jobboard.

  • Add unit tests and functional tests for persistence and multiple conductor scenario.

  • Documentation changes.

For phase-2 * We will promote asynchronous mode of operation as the default mode of operation.

  • We may decide to drop the code for synchronous mode and corresponding config.

  • Documentation changes.

Dependencies

For phase-1, if we choose to implement using taskflow, we need to get following two features added to taskflow first - * Ability to add new task to an existing job on jobboard. * mysql plugin support as persistent DB.

Testing

All the existing test cases are run to ensure async mode does not break them. Additionally more functional tests and unit tests will be added specific to async mode.

Documentation Impact

Magnum documentation will include a description of the option for asynchronous mode of container operations and its benefits. We will also add to developer documentation on guideline for implementing a container operation in both the modes - sync and async. We will add a section on ‘how to debug container operations in async mode’. The phase-0 and phase-1 implementation and their support for single or multiple conductors will be clearly documented for the operators.

References

[1] - Execution time comparison between sync and async modes:

https://gist.github.com/surojit-pathak/2cbdad5b8bf5b569e755

[2] - Proposed change under review:

https://review.openstack.org/#/c/267134/

[3] - Heat’s use of stacklock

http://docs.openstack.org/developer/heat/_modules/heat/engine/stack_lock.html

[4] - openstack/futurist

http://docs.openstack.org/developer/futurist/

[5] - openstack/oslo.messaging

http://docs.openstack.org/developer/oslo.messaging/rpcclient.html

[6] - ML discussion on the design

http://lists.openstack.org/pipermail/openstack-dev/2015-December/082524.html

[7] - Taskflow library

http://docs.openstack.org/developer/taskflow/

[8] - task in taskflow

http://docs.openstack.org/developer/taskflow/atoms.html#task

[9] - job and jobboard in taskflow

http://docs.openstack.org/developer/taskflow/jobs.html

[10] - conductor in taskflow

http://docs.openstack.org/developer/taskflow/conductors.html

[11] - persistent backend support in taskflow

http://docs.openstack.org/developer/taskflow/persistence.html

[12] - oslo.messaging notification handler

http://docs.openstack.org/developer/oslo.messaging/notification_listener.html

[13] - Blueprint for at-least-once-guarantee, oslo.messaging

https://blueprints.launchpad.net/oslo.messaging/+spec/at-least-once-guarantee

[14] - Patchset under review for at-least-once-guarantee, oslo.messaging

https://review.openstack.org/#/c/229186/

[15] - Taskflow blocking mode for conductor

http://docs.openstack.org/developer/taskflow/conductors.html#taskflow.conductors.backends.impl_executor.ExecutorConductor

[16] - Taskflow serial engine

http://docs.openstack.org/developer/taskflow/engines.html

[17] - Community feedback on policy to handle failure within a sequence

http://eavesdrop.openstack.org/irclogs/%23openstack-containers/%23openstack-containers.2016-03-08.log.html#t2016-03-08T20:41:17