Docker Compose Deployment
Docker Compose is the simplest way to run Ironflow with PostgreSQL. This page covers both single-node and multi-node cluster configurations.
Ironflow supports running multiple nodes against a shared PostgreSQL database and external NATS server or cluster. This enables horizontal scaling, high availability, and zero-downtime deployments.
Requirements
| Requirement | Single Node | Multi-Node |
|---|---|---|
| Database | SQLite or PostgreSQL | PostgreSQL only |
| NATS | Embedded (automatic) | Bundled 3-node cluster or External NATS |
| Min nodes | 1 | 2+ |
Starting ironflow with
--nats-urland SQLite will fail at startup with a clear error message.
Architecture
Each Ironflow node is stateless. All coordination uses:
- NATS KV for cron deduplication (first-wins atomic claim)
- PostgreSQL SKIP LOCKED for distributed scheduler wakeups
- NATS STEPS stream (WorkQueuePolicy) for job dispatch
Quick Start
Uses a single NATS server. Good for testing multi-node behavior locally.
cd deploy/docker-compose/clustercp .env.example .envdocker compose upThree Ironflow nodes at :9123, :9124, :9125. Load balancer at :9000.
Uses a 3-node NATS JetStream cluster. Demonstrates real NATS quorum and failover.
cd deploy/docker-compose/clustercp .env.example .envdocker compose -f docker-compose.cluster.yml upThree Ironflow nodes with 3-node NATS cluster. Simulates production topology.
For production, run Ironflow processes directly (systemd, Docker, or your preferred runtime) against managed infrastructure. If you are on Kubernetes, the Helm chart can deploy a bundled 3-node NATS cluster with auth — see the Helm Chart Production (bundled) configuration.
# Node 1IRONFLOW_DATABASE_URL="postgres://ironflow:pass@db.example.com:5432/ironflow?sslmode=require" \NATS_URL="nats://nats-1.example.com:4222,nats://nats-2.example.com:4222,nats://nats-3.example.com:4222" \IRONFLOW_NODE_ID="node-1" \IRONFLOW_MASTER_KEY="<openssl rand -hex 32>" \IRONFLOW_STALE_CLAIM_THRESHOLD="2m" \./ironflow serve
# Node 2 (same config, different node-id)IRONFLOW_NODE_ID="node-2" \# ... same env vars ..../ironflow serveProduction checklist:
- Managed PostgreSQL (RDS, Cloud SQL, etc.) with connection pooling
- 3-node NATS JetStream cluster (bundled via Helm chart, or external with TLS and credentials)
- Stable node IDs for log correlation
- Master key for secrets encryption
- Load balancer in front of all nodes
Configuration Reference
| Flag | Env Var | Default | Description |
|---|---|---|---|
--nats-url | NATS_URL | — | External NATS URL. nats://host:4222, tls://host:4222, or comma-separated seeds. Required for multi-node. |
--nats-creds | NATS_CREDS_FILE | — | Path to NATS .creds file for JWT/NKey auth. See NATS credentials below. |
--node-id | IRONFLOW_NODE_ID | random UUID | Stable node identifier. Recommended in production for log correlation. |
| — | IRONFLOW_DATABASE_URL | — | PostgreSQL DSN (postgres://user:pass@host:5432/db). Required for multi-node. |
| — | IRONFLOW_MASTER_KEY | (unencrypted) | Hex-encoded 32-byte AES-256 key for secrets at rest. Generate: openssl rand -hex 32. Omit only in local dev. |
| — | IRONFLOW_STALE_CLAIM_THRESHOLD | 2m | How long a scheduler claim can be held before another node reclaims it. Go duration string (e.g. 5m). |
NATS Credentials
If your NATS server requires authentication, pass a .creds file via --nats-creds. NATS credentials files are generated with the nsc CLI — Ironflow does not issue or manage NATS credentials.
# Example: generate creds with nsc (NATS operator tooling)nsc add user --account MyAccount --name ironflow-nodensc generate creds --account MyAccount --name ironflow-node > /etc/ironflow/node.creds
# Pass to Ironflow./build/ironflow serve --nats-url nats://nats:4222 --nats-creds /etc/ironflow/node.credsHow Coordination Works
Cron deduplication: When a cron expression fires, each node tries to Create a key in the SYS_cron_triggers NATS KV bucket. Only one node succeeds (atomic first-wins). The key expires after 24h. Duplicate runs are prevented without leader election.
Scheduler wakeups: When sleeping or waiting steps are ready, each node runs UPDATE ... FOR UPDATE SKIP LOCKED. PostgreSQL atomically assigns each row to one node. No step is processed twice. If a node crashes mid-processing, the stale claim recovery goroutine (runs every 60s) resets claimed rows after a configurable threshold (default: 2 minutes).
Job dispatch: All pull-mode jobs are published to the NATS STEPS stream (WorkQueuePolicy). Any node’s connected worker can consume them. Jobs survive node crashes — NATS redelivers after AckWait.
Known Limitations
- WebSocket subscriptions are node-local. Clients that reconnect to a different node lose live subscription state. NATS KV-backed subscriptions (Phase 6) will address this.
Production Considerations
NATS cluster quorum: A 3-node NATS cluster requires 2-of-3 nodes to write. Losing 2 nodes simultaneously will stall JetStream writes. Plan your cluster size and replication factor accordingly.
JetStream replication: Set Replicas in your NATS stream config to match your cluster size for durability. The default is Replicas=1 (no replication — fastest but not fault-tolerant).
PostgreSQL connection pool: Each Ironflow node uses 25 max connections by default. With 3 nodes, plan for ~75 total connections.
Stale claim threshold: Tune IRONFLOW_STALE_CLAIM_THRESHOLD (default: 2m) based on your acceptable recovery latency after a node failure.
Graceful shutdown: Send SIGTERM to allow in-flight executions to drain before the process exits. The drain timeout is 30s by default.
Troubleshooting
external NATS requires PostgreSQL
You set --nats-url but IRONFLOW_DATABASE_URL is not set. Multi-node requires PostgreSQL.
connect to external NATS: connection refused
NATS is not reachable at the given URL. Check the URL, firewall rules, and NATS health (curl http://nats-host:8222/healthz).
Steps stuck in waking status
A node crashed after claiming steps. Wait for StaleClaimThreshold + 60s for automatic recovery, or run:
UPDATE steps SET status = pre_claim_status, claimed_by = NULL, claimed_at = NULL, pre_claim_status = NULLWHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '5 minutes';Duplicate runs in database
Cron deduplication requires the NATS KV SYS_cron_triggers bucket. Check NATS connectivity:
nats kv ls # should show SYS_cron_triggers