Durable Execution Engine

Everruns includes a custom durable execution engine built entirely on PostgreSQL. It provides workflow orchestration, automatic retries, circuit breakers, and distributed task execution without external dependencies.

Architecture

The durable engine is implemented in the everruns-durable crate and provides:

Event-sourced workflows — Deterministic state machines driven by events
PostgreSQL-backed persistence — No additional infrastructure required
Distributed task claiming — Scalable to 1000+ concurrent workers
Push-based notifications — Low-latency task distribution (<10ms P99)
Reliability primitives — Retries, circuit breakers, timeouts, dead letter queues

┌──────────────────────────────────────────────────────────────┐
│                        everruns-durable                              │
├──────────────────────────────────────────────────────────────┤
│  Workflow Engine    Activity Executor    Worker Pool    Scheduler   │
├──────────────────────────────────────────────────────────────┤
│                   WorkflowEventStore (PostgreSQL)                    │
│  - durable_workflow_instances     - durable_task_queue              │
│  - durable_workflow_events        - durable_workers                 │
│  - durable_circuit_breaker_state  - durable_signals                 │
├──────────────────────────────────────────────────────────────┤
│  Reliability: RetryPolicy, CircuitBreaker, TimeoutManager, DLQ      │
├──────────────────────────────────────────────────────────────┤
│  Observability: OTel Tracing, Metrics, Admin API                   │
└──────────────────────────────────────────────────────────────┘

The durable engine is self-contained — no Temporal, Cadence, or other external workflow systems required.

Core Abstractions

Workflow

A deterministic state machine driven by events:

Input/Output types — Serializable data structures
Event handlers — on_start, on_activity_completed, on_activity_failed, on_timer_fired, on_signal
Actions — Return WorkflowAction to schedule activities, timers, or complete

Example workflow:

struct AgentTurnWorkflow;

impl Workflow for AgentTurnWorkflow {
    type Input = TurnRequest;
    type Output = TurnResult;

    async fn on_start(&self, input: Self::Input) -> WorkflowAction {
        // Schedule input activity
        WorkflowAction::ScheduleActivity {
            activity_type: "input".to_string(),
            input: serde_json::to_value(input).unwrap(),
            retry_policy: RetryPolicy::default(),
        }
    }

    async fn on_activity_completed(
        &self,
        activity_type: &str,
        result: serde_json::Value,
    ) -> WorkflowAction {
        match activity_type {
            "input" => {
                // Schedule reason activity
                WorkflowAction::ScheduleActivity {
                    activity_type: "reason".to_string(),
                    input: result,
                    retry_policy: RetryPolicy::default(),
                }
            }
            "reason" => {
                // Schedule act activity or complete
                let has_tools = result["tool_calls"].as_array().map_or(false, |t| !t.is_empty());
                if has_tools {
                    WorkflowAction::ScheduleActivity {
                        activity_type: "act".to_string(),
                        input: result,
                        retry_policy: RetryPolicy::default(),
                    }
                } else {
                    WorkflowAction::CompleteWorkflow { result }
                }
            }
            "act" => {
                // Loop back to reason
                WorkflowAction::ScheduleActivity {
                    activity_type: "reason".to_string(),
                    input: result,
                    retry_policy: RetryPolicy::default(),
                }
            }
            _ => WorkflowAction::FailWorkflow {
                error: format!("Unknown activity: {}", activity_type),
            },
        }
    }
}

WorkflowAction

Actions a workflow can request:

Action	Description
`ScheduleActivity`	Queue activity with retry policy, timeouts, priority
`StartTimer`	Delayed execution (future feature)
`CompleteWorkflow`	Terminal success state
`FailWorkflow`	Terminal failure state
`ScheduleChildWorkflow`	Nested workflows (future feature)
`CancelActivity`	Cancel pending work (future feature)

Activity

A unit of work that may fail and be retried:

#[async_trait]
pub trait Activity: Send + Sync {
    type Input: DeserializeOwned + Send + Sync;
    type Output: Serialize + Send + Sync;

    async fn execute(
        &self,
        input: Self::Input,
        ctx: ActivityContext,
    ) -> Result<Self::Output, Error>;
}

ActivityContext provides:

attempt — Current attempt number (1-based)
heartbeat() — Liveness signal for long-running tasks
is_cancelled() — Check if task was cancelled

Example activity:

struct ReasonActivity;

#[async_trait]
impl Activity for ReasonActivity {
    type Input = ReasonInput;
    type Output = ReasonOutput;

    async fn execute(
        &self,
        input: Self::Input,
        ctx: ActivityContext,
    ) -> Result<Self::Output, Error> {
        // Call LLM
        let response = llm_driver.generate(input.messages).await?;
        
        // Heartbeat for long operations
        ctx.heartbeat().await?;
        
        Ok(ReasonOutput {
            content: response.content,
            tool_calls: response.tool_calls,
        })
    }
}

WorkflowSignal

External signals to running workflows:

cancel — Graceful cancellation
shutdown — Immediate stop
Custom signals for workflow-specific events

Signals are queued in durable_signals table and delivered to workflows.

Persistence

All tables prefixed with durable_ to avoid conflicts:

durable_workflow_instances

Workflow state:

CREATE TABLE durable_workflow_instances (
    id UUID PRIMARY KEY DEFAULT uuidv7(),
    workflow_type TEXT NOT NULL,
    status TEXT NOT NULL,  -- running | completed | failed | cancelled
    input JSONB NOT NULL,
    result JSONB,
    error TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    completed_at TIMESTAMPTZ
);

durable_workflow_events

Append-only event log for replay:

CREATE TABLE durable_workflow_events (
    id UUID PRIMARY KEY DEFAULT uuidv7(),
    workflow_id UUID NOT NULL REFERENCES durable_workflow_instances(id),
    sequence_number INTEGER NOT NULL,
    event_type TEXT NOT NULL,
    event_data JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

Event types:

workflow.started
activity.scheduled
activity.completed
activity.failed
workflow.completed
workflow.failed

durable_task_queue

Activity scheduling with claiming:

CREATE TABLE durable_task_queue (
    id UUID PRIMARY KEY DEFAULT uuidv7(),
    workflow_id UUID NOT NULL REFERENCES durable_workflow_instances(id),
    activity_type TEXT NOT NULL,
    input JSONB NOT NULL,
    status TEXT NOT NULL,  -- pending | claimed | completed | failed
    claimed_by TEXT,
    claimed_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ,
    error TEXT
);

Indexes:

Partial index on status = 'pending' for fast claiming
Index on activity_type for partitioned claiming

durable_workers

Worker registry for monitoring:

CREATE TABLE durable_workers (
    worker_id TEXT PRIMARY KEY,
    status TEXT NOT NULL,  -- active | stopped
    last_heartbeat TIMESTAMPTZ DEFAULT NOW(),
    concurrency INTEGER NOT NULL,
    registered_at TIMESTAMPTZ DEFAULT NOW()
);

Task Distribution

Claiming Strategy

Critical for scalability at 1000+ workers:

-- Workers don't block each other
SELECT * FROM durable_task_queue
WHERE status = 'pending'
  AND activity_type = $1
ORDER BY id
LIMIT $batch_size
FOR UPDATE SKIP LOCKED;

Optimizations:

SKIP LOCKED — Workers don’t wait for each other
Partition by activity_type — Reduces row scanning
Batch claiming — Fewer round trips (default: 10 tasks per claim)
Partial index on status = 'pending' — Smaller index scan

Push-Based Notifications

Low-latency task distribution via PostgreSQL NOTIFY:

-- Trigger fires on task INSERT
CREATE TRIGGER notify_task_available
AFTER INSERT ON durable_task_queue
FOR EACH ROW
WHEN (NEW.status = 'pending')
EXECUTE FUNCTION notify_task_available();

-- Function sends notification
CREATE FUNCTION notify_task_available() RETURNS TRIGGER AS $$
BEGIN
    PERFORM pg_notify('task_available', NEW.activity_type);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Architecture:

Control plane listens on task_available channel via PgListener
Workers subscribe to SubscribeTaskNotifications gRPC stream
Broadcaster pushes notifications from PostgreSQL to workers
Fallback — Workers poll every 10s if stream disconnects

Latency improvement:

Polling (100ms interval): P50=~100ms, P99=~110ms
Push notifications: P50=~4ms, P99=~10ms (96% improvement)

Push notifications reduce task pickup latency by 96% compared to polling.

Reliability

RetryPolicy

Exponential backoff with jitter:

pub struct RetryPolicy {
    pub max_attempts: u32,           // Default: 3
    pub initial_interval: Duration,  // Default: 1s
    pub max_interval: Duration,      // Default: 60s
    pub backoff_coefficient: f64,    // Default: 2.0
    pub jitter: f64,                 // Default: 0.2 (20%)
    pub non_retryable_errors: Vec<String>,
}

Retry delay calculation:

let base_delay = initial_interval * backoff_coefficient.powi(attempt - 1);
let capped_delay = base_delay.min(max_interval);
let jittered_delay = capped_delay * (1.0 + rand(-jitter, jitter));

Example progression (1s initial, 2.0 coefficient, 20% jitter):

Attempt 1: 1s ± 0.2s
Attempt 2: 2s ± 0.4s
Attempt 3: 4s ± 0.8s

CircuitBreaker

Distributed state via database: Configuration:

pub struct CircuitBreakerConfig {
    pub failure_threshold: u32,  // Default: 5
    pub success_threshold: u32,  // Default: 2
    pub reset_timeout: Duration, // Default: 60s
}

State stored in durable_circuit_breaker_state table — shared across workers.

Timeouts

Timeout	Description	Default
`schedule_to_start_timeout`	Max wait in queue	5 minutes
`start_to_close_timeout`	Max execution time	10 minutes
`heartbeat_timeout`	Liveness detection	30 seconds

Heartbeat mechanism:

// In activity
loop {
    // Do work...
    
    // Send heartbeat every 10s
    ctx.heartbeat().await?;
    
    // Check for cancellation
    if ctx.is_cancelled() {
        return Err(Error::Cancelled);
    }
}

If heartbeats stop (worker crash), control plane reclaims task after 30s.

Dead Letter Queue

Failed tasks after retry exhaustion:

CREATE TABLE durable_dead_letter_queue (
    id UUID PRIMARY KEY DEFAULT uuidv7(),
    workflow_id UUID NOT NULL,
    activity_type TEXT NOT NULL,
    input JSONB NOT NULL,
    error TEXT NOT NULL,
    attempts INTEGER NOT NULL,
    failed_at TIMESTAMPTZ DEFAULT NOW()
);

DLQ entries can be:

Investigated for debugging
Replayed after fixes
Manually completed/discarded

Backpressure

Workers report load status in heartbeats:

Worker-Side

High/low watermarks based on load ratio:

let load_ratio = active_tasks as f64 / max_concurrency as f64;
let accepting_tasks = load_ratio < 0.9;  // Stop accepting at 90%

System-Wide

Queue depth relative to total capacity:

let total_capacity = workers.iter().map(|w| w.concurrency).sum();
let load_percentage = (pending_tasks as f64 / total_capacity as f64) * 100.0;

if load_percentage > 80.0 {
    // System under pressure
}

Control plane uses load metrics to:

Reject new workflows (503 Service Unavailable)
Emit alerts
Trigger auto-scaling (future)

Observability

OpenTelemetry Integration

Spans for workflows and activities:

// Workflow span
let span = span!(
    Level::INFO,
    "durable.workflow",
    workflow.type = workflow_type,
    workflow.id = %workflow_id,
);

// Activity span
let span = span!(
    Level::INFO,
    "durable.activity",
    activity.type = activity_type,
    activity.attempt = attempt,
);

Semantic conventions:

durable.workflow.* — Workflow operations
durable.activity.* — Activity execution
durable.worker.* — Worker lifecycle

Metrics Dashboard

Real-time metrics via SSE streaming:

Workflow Status — Running + pending (stacked area), completed/failed rate (lines)
Task Status — Pending + claimed (stacked area), completed/failed rate (lines)
Throughput — Completed/failed tasks per interval (delta rates)
System Load — Load % (left axis), workers + DLQ size (right axis)

Architecture:

Background task samples SystemHealth every 10s
Stores in VecDeque<MetricsPoint> ring buffer (max 360 = 1 hour)
Exposed via SSE snapshot events and GET /v1/durable/metrics/timeseries
UI renders last 15 minutes with recharts

MetricsPoint fields:

pub struct MetricsPoint {
    pub timestamp: DateTime<Utc>,
    // Gauges
    pub running_workflows: u64,
    pub pending_workflows: u64,
    pub pending_tasks: u64,
    pub claimed_tasks: u64,
    pub active_workers: u64,
    pub load_percentage: f64,
    pub dlq_size: u64,
    // Cumulative totals (for delta rates)
    pub tasks_completed_total: u64,
    pub tasks_failed_total: u64,
    pub workflows_completed_total: u64,
    pub workflows_failed_total: u64,
}

Admin API

# List active workers
GET /v1/durable/workers

# Get workflow status
GET /v1/durable/workflows/{workflow_id}

# List DLQ entries
GET /v1/durable/dlq

# Metrics time series
GET /v1/durable/metrics/timeseries

Worker Communication

Workers communicate with control plane exclusively via gRPC:

gRPC Operations

Operation	Description
`ClaimDurableTasks`	Poll for pending tasks
`CompleteDurableTask`	Report task success
`FailDurableTask`	Report task failure
`HeartbeatDurableTask`	Liveness signal
`CreateDurableWorkflow`	Start workflow
`GetDurableWorkflowStatus`	Check workflow state
`SubscribeTaskNotifications`	Push notifications (streaming)

Authentication (TM-DURABLE-002)

Bearer Token (required in production):

WORKER_GRPC_AUTH_TOKEN=secret-token-here

Server panics if unset in production. Mutual TLS (optional):

WORKER_GRPC_TLS_CA_CERT=/path/to/ca.pem
WORKER_GRPC_TLS_CERT=/path/to/worker-cert.pem
WORKER_GRPC_TLS_KEY=/path/to/worker-key.pem

Provides transport encryption + mutual identity verification.

Workers are intentionally cross-org — they process tasks from any organization’s queue. Org-scoping is enforced at the HTTP API layer.

Startup Retry

Workers retry control plane connection for up to 5 seconds:

let backoff = ExponentialBackoff {
    initial_interval: Duration::from_millis(100),
    max_interval: Duration::from_secs(1),
    max_elapsed_time: Some(Duration::from_secs(5)),
    ..Default::default()
};

Handles startup race conditions when both services restart simultaneously.

Worker Heartbeat & Stale Detection

Workers send heartbeats every 5 seconds. The WORKER_HEARTBEAT_TIMEOUT_SECS constant (60s) is the single source of truth: Stale worker detection:

get_system_health — Only counts workers with heartbeat within 60s as active
list_workers — Only returns workers with fresh heartbeats
reclaim_stale_tasks — Marks workers with stale heartbeats as stopped, reclaims their tasks

Stale task reclaim flow:

Worker crashes without calling deregister_worker
Heartbeats stop
After 60s, control plane marks worker as stopped
Tasks claimed by that worker are re-queued as pending
Other workers can claim and execute them

Works in both dev mode (in-memory) and full mode (PostgreSQL).

Benchmarks

Load tests validate performance and scalability:

In-Memory Benchmarks

Fast iteration without database overhead:

just durable-bench

concurrent_workers — Task claiming at various worker counts
workflow_throughput — Multi-step workflow execution
cold_start_latency — Enqueue to pickup latency

PostgreSQL Benchmarks

Real database performance:

just durable-bench-db  # Auto-starts Docker

db_concurrent_workers — Task claiming with real PostgreSQL
db_workflow_throughput — Multi-step workflows with persistence
db_cold_start_latency — Polling vs push notification latency

Checkpointing

Save results for historical comparison:

just durable-bench --save
just durable-bench-db --save ci-4cpu-8gb

Checkpoints stored in crates/durable/benches/checkpoints/.

Best Practices

Workflow Design

Keep workflows deterministic — Same input always produces same event sequence
Use activities for side effects — LLM calls, database writes, HTTP requests
Handle failures gracefully — Implement retry policies and fallbacks
Avoid long-running activities — Use heartbeats, consider splitting

Activity Implementation

Idempotent operations — Activities may be retried
Heartbeat for long tasks — Send heartbeats every 10s
Check cancellation — Respect ctx.is_cancelled()
Structured errors — Use error types for retry decisions

Retry Policies

// Aggressive retries for transient errors
RetryPolicy {
    max_attempts: 5,
    initial_interval: Duration::from_millis(500),
    max_interval: Duration::from_secs(30),
    backoff_coefficient: 2.0,
    jitter: 0.2,
    non_retryable_errors: vec!["InvalidInput".to_string()],
}

// Conservative retries for expensive operations
RetryPolicy {
    max_attempts: 3,
    initial_interval: Duration::from_secs(2),
    max_interval: Duration::from_secs(60),
    backoff_coefficient: 2.0,
    jitter: 0.2,
    non_retryable_errors: vec![],
}

Monitoring

Track DLQ size — Alert when DLQ entries accumulate
Monitor load percentage — Scale workers when load > 80%
Watch task latency — P99 latency indicates worker saturation
Circuit breaker state — Alert when circuits open

Next Steps

Architecture Overview

Understand the overall system architecture

Sessions

Learn about session execution

Scheduled Tasks

Create cron-based scheduled tasks

Durable API

API reference for workflow management

Get Started

Core Concepts

Guides

Integrations

Deployment

Documentation Index

​Architecture

​Core Abstractions

​Workflow

​WorkflowAction

​Activity

​WorkflowSignal

​Persistence

​durable_workflow_instances

​durable_workflow_events

​durable_task_queue

​durable_workers

​Task Distribution

​Claiming Strategy

​Push-Based Notifications

​Reliability

​RetryPolicy

​CircuitBreaker

​Timeouts

​Dead Letter Queue

​Backpressure

​Worker-Side

​System-Wide

​Observability

​OpenTelemetry Integration

​Metrics Dashboard

​Admin API

​Worker Communication

​gRPC Operations

​Authentication (TM-DURABLE-002)

​Startup Retry

​Worker Heartbeat & Stale Detection

​Benchmarks

​In-Memory Benchmarks

​PostgreSQL Benchmarks

​Checkpointing

​Best Practices

​Workflow Design

​Activity Implementation

​Retry Policies

​Monitoring

​Next Steps

Architecture Overview

Sessions

Scheduled Tasks

Durable API

Architecture

Core Abstractions

Workflow

WorkflowAction

Activity

WorkflowSignal

Persistence

durable_workflow_instances

durable_workflow_events

durable_task_queue

durable_workers

Task Distribution

Claiming Strategy

Push-Based Notifications

Reliability

RetryPolicy

CircuitBreaker

Timeouts

Dead Letter Queue

Backpressure

Worker-Side

System-Wide

Observability

OpenTelemetry Integration

Metrics Dashboard

Admin API

Worker Communication

gRPC Operations

Authentication (TM-DURABLE-002)

Startup Retry

Worker Heartbeat & Stale Detection

Benchmarks

In-Memory Benchmarks

PostgreSQL Benchmarks

Checkpointing

Best Practices

Workflow Design

Activity Implementation

Retry Policies

Monitoring

Next Steps