Launchpad blueprint: https://blueprints.launchpad.net/barbican/+spec/add-worker-retry-update-support
The Barbican worker processes need a means to support retrying failed yet recoverable tasks (such as when remote systems are unavailable) and for handling updates for long-running order processes such as certificate generation. This blueprint defines the requirements for this retry and update processing, and proposes an implementation to add this feature.
Barbican manages asynchronous tasks, such as generating secrets, via datastore tracking entities such as orders (currently the only tracking entity in Barbican). These entities have a status field that tracks their state, starting with PENDING for new entities, and moving to either ACTIVE or ERROR states for successful or unsuccessful termination of the asynchronous task respectively.
Barbican worker processes implement these asynchronous tasks, as depicted on this wiki page: https://github.com/cloudkeep/barbican/wiki/Architecture
As shown in the diagram, a typical deployment can include multiple worker processes operating in parallel off a tasking queue. The queue invokes task methods on the worker processes via RPC. In some cases, these invoked tasks require the entity (eg. order) to stay PENDING, either to allow for follow on processing in the future or else to retry processing due to a temporary blocking condition (eg. remote service is not available at this time).
The following are requirements for retrying tasks in the future and thus keeping the tracking entity in the PENDING state:
R-1) Barbican needs to support extended workflow processes whereby an entity might be PENDING for a long time, requiring periodic status checks to see if the workflow is completed R-2) Barbican needs to support re-attempting an RPC task at some point in the future if dependent services are temporarily unavailable
Note that this blueprint does not handle concurrent updates made to the same entity, say to perform a periodic status check on an order and also apply client updates to that same order. This will be addressed in a future blueprint.
Note also that this blueprint does not handle entities that are ‘stuck’ in the PENDING state because of lost messages in the queue or workers that crash while processing an entity. This will also be addressed in a future blueprint.
In addition, the following non-functional requirements are needed in the final implementation:
NF-1) To keep entity state consistent, only one worker can work on an entity or manage retrying tasks at a time. NF-2) For resilience of the worker cluster: a) Any worker process (of a cluster of workers) should be able to handle retrying entities independently of other worker processes, even if these worker processes are intermittently available. b) If a worker comes back online after going down, it should be able to start processing retry tasks again, without need to synchronize with other workers. NF-3) In the default standalone Barbican implementation, it should be possible to demonstrate the periodic status check feature via the SimpleCertificatePlugin class in barbican.plugin.simple_certificate-Manager.py.
The following assumptions are made:
A-1) Accurate retry times are not required: a) For example, if a task is to be retried in 5 minutes, it would be acceptable if the task was actually retried after more than 5 minutes. For SSL certificate workflows, where some certificate types can take days to process, such retry delays would not be significant. b) Relaxed retry schedules allow for more granular retry checking intervals, and to allow for delays due to excessive tasks in queues during busy times. c) Excessive delays in retry times from expected could indicate that worker nodes are overloaded. This blueprint does not address this issue, deferring to deployment monitoring and scaling processes.
This blueprint proposes that for requirements R-1 and R-2, the plugins used by worker tasks (such as the certificate plugin) determine if tasks should be retried and at what time in the future. If plugins determine that a task should be retried, then these tasks will be scheduled for a future retry attempt.
To implement this scheduling process, this blueprint proposes using the Oslo periodic task feature, described here:
A working example implementation with an older code base is shown here:
Each worker node could then execute a periodic task service, that invokes a method on a scheduled basis (configurable, say every 15 seconds). This method would then query which tasks need to be retried (say if current time >= retry time), and for each one issue a retry task message to the queue. Once tasks are enqueued, this method would remove the retry records from the retry list. Eventually the queue would invoke workers to implement these retry tasks.
To provide a means to evaluate the retry feature in standalone Barbican per NF-3, the SimpleCertificatePlugin class in barbican.plugin.simple_certificate_manager.py would be modified to have the issue_certificate_request() method return a retry time of 5 seconds (configurable). The check_certificate_status() method would then return a successful execution to terminate the order in the ACTIVE state.
This blueprint proposes adding two entities to the data model: OrderRetryTask and EntityLock.
The OrderRetryTask entity would manage which tasks need to be retried on which entities, and would have the following attributes:
1) id: Primary key for this record 2) order_id: FK to the order record the retry task is intended for 3) retry_task: The RPC method to invoke for the retry. This method could be a different method than the current one, such as to support a SSL certificate plugin checking for certificate updates after initiating the certificate process 4) retry_at: The timestamp at or after which to retry the task 5) retry_args: A list of args to send to the retry_task. This list includes the entity ID, so no need for an entity FK in this entity 6) retry_kwargs: A JSON-ified dict of the kwargs to send to retry_task 7) retry_count: A count of how many times this task has been retried
New retry records would be added for tasks that need to be retried in the future, as determined by the plugin as part of workflow processing. The next periodic task method invocation would then send this task to the queue for another worker to implement later.
The EntityLock entity would manage which worker is allowed to delete from the OrderRetryTask table, since per NF-1 above only one worker should be able to delete from this table. This entity would have the following attributes:
1) entity_to_lock: The name of the entity to lock ('OrderRetryTask' here). This would be a primary key. 2) worker_host_name: The host name of the worker that has the OrderRetryTask entity 'locked'. 3) created_at: When this table was locked.
This entity would only have zero or one records. So the periodic method above would execute the following pseudo code:
Start SQLAlchemy session/transaction try: Attempt to insert a new record into the EntityLock table session.commit() except: session.rollback() Handle 'stuck' locks (see paragraph below) return try: Query for retry tasks Send retry tasks to the queue Remove enqueued retry tasks from OrderRetryTask table session.commit() except: session.rollback() finally: Remove record from EntityLock table Clear SQLAlchemy session/transaction
Lock tables can be problematic if the locking process crashes without removing the locks. The overall time a worker holds on to a lock should be brief however, so the lock attempt rollback process above should check for and remove a stale lock based on the ‘created_at’ time on the lock.
To separate coding concerns, it makes sense to implement this process in a separate Oslo ‘service’ server process, similar to the Keystone listener approach This service would only run the Oslo periodic task method, to perform the retry updating process. If the method failed to operate, say due to another worker locking resource, it could just return/exit. The next periodic call would then start the process again.
Rather than having each worker process manage retrying tasks, a separate node could be designated to manage these retries. This would eliminate the need for the EntityLock entity. However, this approach would require configuring yet another node in the Barbican network, adding to deployment complexity. This manager node would also be a single point of failure for managing retry tasks.
As mentioned above, two new entities would be required. No migrations would be needed.
The addition of a periodic task to identify task to be retried presents an extra load on the worker nodes (assuming they are co-located processes to the normal worker processing, as expected). However, this process does not perform the retry work, but rather issues tasks into the queue to then evenly distribute back to the worker processes. Hence the additional load on a given worker should be minimal.
This proposal includes utilizing locks to deal with concurrency concerns across the multiple worker nodes that could be handling retry tasks. This can result in two performance impacts: (1) multiple workers might fight to grab the lock simultaneously leading to degraded performance for the workers that fail to grab the lock, and (2) a lock could become ‘stuck’ if a worker holding the lock crashes.
Regarding (1), locks are only utilized on the worker nodes involved in processing asynchronous tasks which are not time sensitive. Also, the time the lock is utilized will be very brief, just long enough to perform a query for retry tasks and to send those tasks to queue for follow on processing. In addition the periodic process of each worker node handles these retry tasks, so if the deployment of worker nodes is staggered the retry processes should not conflict. Another option is to randomly dither the periodic interval (eg. 30 seconds +- 5 seconds) so that worker nodes are less likely to conflict with each other.
Regarding concern (2) about ‘stuck’ locks, since the conditions which involve locks are either long-running orders that can suffer delays until locks are restored, or else are (hopefully) rare conditions when resources aren’t available, this condition should not be critical to resolve. The proposal does however suggest a means to remove stuck locks utilizing their created-at times.
The Barbican configuration file will need a configuration parameter to periodically run the retry-query process, called ‘schedule_period_seconds’, with a default value of 15 seconds. This parameter would be placed in a new ‘[scheduler]’ group.
A configuration parameter called ‘retry_lock_timeout_seconds’ would be used to release ‘stuck’ locks on the retry tasks table, as described in the ‘Proposed Change’ section above. This parameter would also be added to the ‘[scheduler]’ group.
A configuration parameter called ‘delay_before_update_seconds’ would be used to configure the amount of time the SimpleCertificatePlugin delays from initiating a demo certificate order to the time the update certificate method is invoked. This parameter would be placed in a new ‘[simple_certificate]’ group.
These configurations would be applied and utilized once the revised code base is deployed.
Note that for #6, the ‘queue’ and ‘tasks’ packages have to be modified somewhat to allow the server logic to send messages to the queue via the client logic, mainly to break circular dependencies. Again, see the example here for a working example of this server/client/retry processing.
In addition to planned unit testing, the functional Tempest-based tests in the Barbican repository would be augmented to add a test of the new retry feature for the default certificate plugin.
Developer guides will need to updated, to include the additional periodic retry process detailed above. Deployment guides will need to be updated to specify that a new process needs to executed (for the bin/barbican-task-scheduler.sh process).