Skip to content

Docker Compose Deployment

Docker Compose is the simplest way to run Ironflow with PostgreSQL. This page covers both single-node and multi-node cluster configurations.

Ironflow supports running multiple nodes against a shared PostgreSQL database and external NATS server or cluster. This enables horizontal scaling, high availability, and zero-downtime deployments.

Requirements

RequirementSingle NodeMulti-Node
DatabaseSQLite or PostgreSQLPostgreSQL only
NATSEmbedded (automatic)Bundled 3-node cluster or External NATS
Min nodes12+

Starting ironflow with --nats-url and SQLite will fail at startup with a clear error message.

Architecture

Each Ironflow node is stateless. All coordination uses:

  • NATS KV for cron deduplication (first-wins atomic claim)
  • PostgreSQL SKIP LOCKED for distributed scheduler wakeups
  • NATS STEPS stream (WorkQueuePolicy) for job dispatch

Quick Start

Uses a single NATS server. Good for testing multi-node behavior locally.

Terminal window
cd deploy/docker-compose/cluster
cp .env.example .env
docker compose up

Three Ironflow nodes at :9123, :9124, :9125. Load balancer at :9000.

Configuration Reference

FlagEnv VarDefaultDescription
--nats-urlNATS_URLExternal NATS URL. nats://host:4222, tls://host:4222, or comma-separated seeds. Required for multi-node.
--nats-credsNATS_CREDS_FILEPath to NATS .creds file for JWT/NKey auth. See NATS credentials below.
--node-idIRONFLOW_NODE_IDrandom UUIDStable node identifier. Recommended in production for log correlation.
IRONFLOW_DATABASE_URLPostgreSQL DSN (postgres://user:pass@host:5432/db). Required for multi-node.
IRONFLOW_MASTER_KEY(unencrypted)Hex-encoded 32-byte AES-256 key for secrets at rest. Generate: openssl rand -hex 32. Omit only in local dev.
IRONFLOW_STALE_CLAIM_THRESHOLD2mHow long a scheduler claim can be held before another node reclaims it. Go duration string (e.g. 5m).

NATS Credentials

If your NATS server requires authentication, pass a .creds file via --nats-creds. NATS credentials files are generated with the nsc CLI — Ironflow does not issue or manage NATS credentials.

Terminal window
# Example: generate creds with nsc (NATS operator tooling)
nsc add user --account MyAccount --name ironflow-node
nsc generate creds --account MyAccount --name ironflow-node > /etc/ironflow/node.creds
# Pass to Ironflow
./build/ironflow serve --nats-url nats://nats:4222 --nats-creds /etc/ironflow/node.creds

How Coordination Works

Cron deduplication: When a cron expression fires, each node tries to Create a key in the SYS_cron_triggers NATS KV bucket. Only one node succeeds (atomic first-wins). The key expires after 24h. Duplicate runs are prevented without leader election.

Scheduler wakeups: When sleeping or waiting steps are ready, each node runs UPDATE ... FOR UPDATE SKIP LOCKED. PostgreSQL atomically assigns each row to one node. No step is processed twice. If a node crashes mid-processing, the stale claim recovery goroutine (runs every 60s) resets claimed rows after a configurable threshold (default: 2 minutes).

Job dispatch: All pull-mode jobs are published to the NATS STEPS stream (WorkQueuePolicy). Any node’s connected worker can consume them. Jobs survive node crashes — NATS redelivers after AckWait.

Known Limitations

  • WebSocket subscriptions are node-local. Clients that reconnect to a different node lose live subscription state. NATS KV-backed subscriptions (Phase 6) will address this.

Production Considerations

NATS cluster quorum: A 3-node NATS cluster requires 2-of-3 nodes to write. Losing 2 nodes simultaneously will stall JetStream writes. Plan your cluster size and replication factor accordingly.

JetStream replication: Set Replicas in your NATS stream config to match your cluster size for durability. The default is Replicas=1 (no replication — fastest but not fault-tolerant).

PostgreSQL connection pool: Each Ironflow node uses 25 max connections by default. With 3 nodes, plan for ~75 total connections.

Stale claim threshold: Tune IRONFLOW_STALE_CLAIM_THRESHOLD (default: 2m) based on your acceptable recovery latency after a node failure.

Graceful shutdown: Send SIGTERM to allow in-flight executions to drain before the process exits. The drain timeout is 30s by default.

Troubleshooting

external NATS requires PostgreSQL You set --nats-url but IRONFLOW_DATABASE_URL is not set. Multi-node requires PostgreSQL.

connect to external NATS: connection refused NATS is not reachable at the given URL. Check the URL, firewall rules, and NATS health (curl http://nats-host:8222/healthz).

Steps stuck in waking status A node crashed after claiming steps. Wait for StaleClaimThreshold + 60s for automatic recovery, or run:

UPDATE steps SET status = pre_claim_status, claimed_by = NULL, claimed_at = NULL, pre_claim_status = NULL
WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '5 minutes';

Duplicate runs in database Cron deduplication requires the NATS KV SYS_cron_triggers bucket. Check NATS connectivity:

Terminal window
nats kv ls # should show SYS_cron_triggers