Supporting HA for background jobs in Watcher

Problem description

The main problem with the current implementation of the background scheduling of jobs is if two or more decision-engine services are going to take persisted jobs that are stored in DB then each DE service will take these jobs twice or more.

There are no bindings like ‘job <-> service’ so main goal of this BP is to provide a mechanism that will tag (route) jobs for DEs according to the service instance they have associated with. Another future goal for this feature would be to provide a job requeuing mechanism when the bound ‘service_id’ is marked as failed.

Use Cases

As a Watcher administrator, I want to be sure that all Watcher services using background jobs, which are working in HA-mode, will be synced with scheduled jobs.

Proposed change

APScheduler already has SQLAlchemyJobStore class that create table with the following columns:

  • id

  • next_run_time

  • job_state

Column ‘job_state’ contains dumped object of Job. As we can see, SQLAlchemyJobStore doesn’t have relation between Decision Engine and its jobs. I suggest to add new column ‘service_id’ that will be foreign key for This relation should entirely define relation between job and service. There also should be added new column ‘tag’ that will contain dict with keys and values which will identify service uniquely. It usually is to contain ‘host’ and ‘name’ keys. Each type of job should then have its own tag so we can easily perform some triaging/filtering on the list of jobs.



Data model impact

New table ‘apscheduler_jobs’ is to be added with the following columns: * id * next_run_time * job_state * service_id * tag

Notifications impact

Since it is internal changes that shouldn’t affect on user experience and common workflow, no new notifications are expected.

Other end user impact

‘apscheduler_jobs’ table should be included in new alembic version for the Watcher DB.

Primary assignee:

Alexander Chadin <alexchadin>

Work Items

  • Inherit new class from SQLAlchemyJobStore that will contain ‘service_id’ column and overwrite appropriate methods.

  • Update watcher/decision_engine/audit/ to let it work with new job store.

  • Implement appropriate unit tests to test various scenarios.


Appropriate unit tests will be adapted to new changes.

