Skip to content

Part 1: Design

Before writing any code, it's worth spending time thinking through the domain model and key design decisions. This upfront investment saves rework later — the data model and task graph you sketch here will guide the database schema, task classes, and API endpoints you build in the following parts.

For real contributions

If you were adding a new system to DiracX (rather than following a tutorial), you'd typically start by opening an issue to discuss the design with maintainers, or writing an ADR for more significant changes. See Designing functionality for guidance.

Identify entities

Our system has two main entities:

Entity Description
Compute Element (CE) A site where pilots can be submitted. Has a capacity and reliability.
Pilot Submission A record of a pilot sent to a CE. Transitions through lifecycle states.

State machine

Pilots follow this lifecycle:

stateDiagram-v2 [*] --> SUBMITTED SUBMITTED --> RUNNING RUNNING --> DONE RUNNING --> FAILED

Task graph

We need four tasks:

Task Type Schedule Purpose
MySubmitPilotsTask PeriodicVoAwareBaseTask Every 60s Finds available CEs, spawns MyPilotTask per slot
MyPilotTask BaseTask (one-shot) On demand Submits a single pilot to a CE
MyCheckPilotsTask PeriodicVoAwareBaseTask Every 30s Transitions pilot states
MyPilotReportTask PeriodicBaseTask Hourly (cron) Logs aggregate statistics

What does 'VO-aware' mean?

A Virtual Organisation (VO) is a group of users and resources organised around a common goal. VO-aware tasks run once per VO — so if you have three VOs configured, MySubmitPilotsTask spawns three independent instances, each submitting pilots for its own VO. Non-VO-aware tasks like MyPilotReportTask run once globally.

graph TD A[MySubmitPilotsTask<br/>periodic, VO-aware] -->|spawns| B[MyPilotTask<br/>one-shot] B -->|calls| L[gubbins.logic.my_pilots] C[MyCheckPilotsTask<br/>periodic, VO-aware] -->|calls| L E[MyPilotReportTask<br/>periodic, cron] -->|calls| L L -->|reads/writes| D[(MyPilotDB)]

Locking strategy

  • MyPilotTask: MutexLock(MY_PILOT, ce_name) — serialise submissions to the same CE
  • MySubmitPilotsTask: default VO-scoped mutex — each VO submits independently
  • MyCheckPilotsTask: default VO-scoped mutex — each VO checks independently
  • MyPilotReportTask: default class-level mutex — only one report at a time

Retry policy

MyPilotTask uses NoRetry(). Why? Because the periodic parent (MySubmitPilotsTask) will naturally re-evaluate available CEs on the next cycle and resubmit. Explicit retries would add complexity without benefit here.

Failed tasks are not dead-letter-queue eligible (dlq_eligible = False). Pilots are ephemeral — there will always be more on the next cycle, so there's no value in preserving failed submissions for manual recovery. The DLQ is reserved for tasks that correspond to external state which must always be recovered (e.g. failing to optimise a job, or submitting a transformation task).

What's next

With the design in hand, we'll start implementing. The order is: database → logic → tasks → router → tests.