Failure & Recovery Scenarios

Ironflow pods crashing (CrashLoopBackOff)

Trigger: kubectl get pods shows CrashLoopBackOff. Pods start and immediately exit.

Steps:

# 1. Check pod events for scheduling or pull errors
kubectl describe pod -n ironflow -l app.kubernetes.io/component=server | tail -30

# 2. Check logs from the crashed container (--previous gets logs from the last crash)
kubectl logs -n ironflow -l app.kubernetes.io/component=server --previous --tail=50

Identify the cause from the logs and follow the appropriate branch:

OOM Killed — logs show OOMKilled in pod status or Killed in dmesg:

# Confirm OOM
kubectl get pods -n ironflow -l app.kubernetes.io/component=server -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}'
# If "OOMKilled", increase memory limits
ironflow deploy upgrade --template medium --name my-release \
  --set resources.limits.memory=1Gi \
  --set resources.requests.memory=512Mi

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values \
#   --set resources.limits.memory=1Gi \
#   --set resources.requests.memory=512Mi

Bad configuration — logs show config parse errors, missing env vars, or invalid flags:

# Inspect the ConfigMap and environment
kubectl get configmap -n ironflow -l app.kubernetes.io/instance=ironflow -o yaml
kubectl get deployment ironflow -n ironflow -o jsonpath='{.spec.template.spec.containers[0].env}' | jq .

Dependency unavailable — logs show connection refused to NATS or PostgreSQL:

Small/Medium (bundled)
Large / Hetzner bootstrap

# Check NATS health
kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/healthz

# Check PostgreSQL health
kubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \
  -- pg_isready -h localhost

# Check NATS health
kubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/healthz

# Check PostgreSQL health
kubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \
  -- pg_isready -h localhost

If NATS or PostgreSQL is down, resolve the dependency first. See IronflowDown runbook for the full decision tree.

NATS cluster loses quorum

Trigger: 2 of 3 NATS pods down. JetStream operations failing. Workers disconnecting. Applies to Medium template only (Small uses a single NATS pod; Large uses external NATS).

Steps:

Medium (bundled)
Production (external NATS)

# 1. Check NATS pod status
kubectl get pods -n ironflow -l app.kubernetes.io/name=nats

# 2. Check events on failing pods
kubectl describe pod -n ironflow ironflow-nats-1
kubectl describe pod -n ironflow ironflow-nats-2

# 3. Check PVCs — if stuck in Pending, the volume couldn't bind
kubectl get pvc -n ironflow -l app.kubernetes.io/name=nats

# 4. Check logs for raft election or storage errors
kubectl logs -n ironflow ironflow-nats-0 -c nats --tail=30
kubectl logs -n ironflow ironflow-nats-1 -c nats --tail=30

# 5. Delete stuck pods to let the StatefulSet controller recreate them
kubectl delete pod -n ironflow ironflow-nats-1
kubectl delete pod -n ironflow ironflow-nats-2

# 6. Wait for pods to come back
kubectl rollout status statefulset/ironflow-nats -n ironflow --timeout=120s

# 7. Verify cluster health — all 3 nodes should appear in routes
kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/routez | python3 -m json.tool
# "num_routes" should equal 2 (each node has routes to the other 2)

# 8. Verify JetStream is operational
kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/jsz

# 1. Check NATS pod status
kubectl get pods -n ironflow-system -l app.kubernetes.io/name=nats

# 2. Check events on failing pods
kubectl describe pod -n ironflow-system nats-1
kubectl describe pod -n ironflow-system nats-2

# 3. Check PVCs
kubectl get pvc -n ironflow-system -l app.kubernetes.io/name=nats

# 4. Check logs
kubectl logs -n ironflow-system nats-0 -c nats --tail=30
kubectl logs -n ironflow-system nats-1 -c nats --tail=30

# 5. Delete stuck pods to let the StatefulSet controller recreate them
kubectl delete pod -n ironflow-system nats-1
kubectl delete pod -n ironflow-system nats-2

# 6. Wait for pods to come back
kubectl rollout status statefulset/nats -n ironflow-system --timeout=120s

# 7. Verify cluster health
kubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/routez | python3 -m json.tool

# 8. Verify JetStream
kubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/jsz

If pods keep crashing after recreation, check the NATS JetStream data directory for corruption. As a last resort, delete the PVCs (this loses JetStream state — Ironflow recreates streams on startup, but in-flight messages are lost):

# CAUTION: data loss — only if cluster cannot recover
kubectl delete pvc -l app.kubernetes.io/name=nats -n <namespace>
kubectl delete pod -l app.kubernetes.io/name=nats -n <namespace>

See NATSDown runbook for additional diagnostics.

PostgreSQL primary down (unplanned failover)

Trigger: CNPG auto-promotes a replica to primary. Brief 503s from Ironflow during failover (typically < 30 seconds).

Steps:

Small/Medium (bundled)
Large / Hetzner bootstrap

# 1. Check CNPG cluster status
kubectl get cluster ironflow-postgresql -n ironflow
# STATUS should transition from "Failover in progress" to "Cluster in healthy state"

# 2. Monitor the failover progress
kubectl describe cluster ironflow-postgresql -n ironflow | grep -A 10 "Status:"

# 3. Identify the new primary
kubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary
# A different pod should now be primary

# 4. Verify the new primary is accepting connections
kubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \
  -- pg_isready -h localhost

# 5. Verify Ironflow has reconnected
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20 | grep -i "database\|postgres\|connect"

# 6. Check that the readiness probe is passing again
kubectl get pods -n ironflow -l app.kubernetes.io/component=server
# All pods should show 1/1 Ready

# 1. Check CNPG cluster status
kubectl get cluster ironflow-db -n ironflow-system
# STATUS should transition from "Failover in progress" to "Cluster in healthy state"

# 2. Monitor the failover progress
kubectl describe cluster ironflow-db -n ironflow-system | grep -A 10 "Status:"

# 3. Identify the new primary
kubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary

# 4. Verify the new primary is accepting connections
kubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \
  -- pg_isready -h localhost

# 5. Verify Ironflow has reconnected
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20 | grep -i "database\|postgres\|connect"

# 6. Check readiness
kubectl get pods -n ironflow -l app.kubernetes.io/component=server

Post-failover notes:

CNPG handles promotion automatically. No manual intervention is needed unless the cluster status stays unhealthy.
The old primary pod will restart as a replica. Check that it rejoins: kubectl get pods -l cnpg.io/cluster=<cluster-name> -n <namespace>.
If Ironflow pods do not recover within 60 seconds, restart them: kubectl rollout restart deployment/ironflow -n ironflow.

See PostgreSQLDown runbook for extended diagnostics and manual promotion procedures.

Node failure (Hetzner VM dies)

Trigger: kubectl get nodes shows a node NotReady. Pods on that node evicted.

Steps:

# 1. Check node status
kubectl get nodes -o wide
# Look for NotReady status and the affected node name

# 2. Check what was running on the failed node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# 3. Kubernetes reschedules pods automatically (PDB ensures availability)
# Watch pod rescheduling — new pods should appear on healthy nodes
kubectl get pods -n ironflow -w

# 4. Verify all components are healthy after rescheduling
kubectl get pods -n ironflow -l app.kubernetes.io/component=server

Verify infrastructure pods have rescheduled:

Small/Medium (bundled)
Large / Hetzner bootstrap

kubectl get pods -n ironflow -l app.kubernetes.io/name=nats
kubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql

kubectl get pods -n ironflow-system -l app.kubernetes.io/name=nats
kubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db

If the node won’t recover:

# 5. Remove the dead node from the cluster
kubectl delete node <node-name>

# 6. Replace the node via Terraform
cd deploy/terraform/hetzner
terraform plan   # Review — should show 1 node to replace
terraform apply  # Creates new VM, joins cluster

# Or via hcloud CLI if not using Terraform
hcloud server delete <server-name>
# Then re-run provisioning to add a replacement node
ironflow provision create --provider hetzner --template medium --name ironflow

StatefulSet pods (NATS, PostgreSQL): These use PVCs backed by networked storage. When the pod reschedules to a new node, Kubernetes reattaches the volume. This may take 1-2 minutes while the volume detaches from the dead node. If the volume stays stuck in Attaching state for more than 5 minutes, force-detach via the Hetzner Cloud Console or hcloud volume detach <volume-id>.

Persistent volume full (NATS or PostgreSQL)

Trigger: DiskSpaceLow alert fires. JetStream or PostgreSQL can’t write.

Steps:

Small/Medium (bundled)
Large / Hetzner bootstrap

# 1. Check PVC usage — NATS
kubectl exec -n ironflow ironflow-nats-0 -c nats -- df -h /data
# Look at Use% — over 85% is concerning, over 95% is critical

# 2. Check PVC usage — PostgreSQL
kubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \
  -- df -h /var/lib/postgresql/data

# 3. Check current PVC sizes
kubectl get pvc -n ironflow

# 1. Check PVC usage — NATS
kubectl exec -n ironflow-system nats-0 -c nats -- df -h /data

# 2. Check PVC usage — PostgreSQL
kubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \
  -- df -h /var/lib/postgresql/data

# 3. Check current PVC sizes
kubectl get pvc -n ironflow-system

Option A: Expand the PVC (if the storage class supports volume expansion):

# Check if expansion is allowed
kubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}: allowVolumeExpansion={.allowVolumeExpansion}{"\n"}{end}'

# Expand a PVC (example: NATS data volume)
kubectl patch pvc <pvc-name> -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# The expansion happens online — no pod restart needed for most CSI drivers

Option B: Clean old data — NATS:

Small/Medium (bundled)
Large / Hetzner bootstrap

# Check JetStream stream sizes
kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/jsz?streams=true

# Purge a specific stream (removes all messages — use with caution)
# This is safe for Ironflow because historical data lives in PostgreSQL, not NATS
kubectl exec -n ironflow ironflow-nats-0 -c nats -- nats stream purge <stream-name> --force

kubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/jsz?streams=true

kubectl exec -n ironflow-system nats-0 -c nats -- nats stream purge <stream-name> --force

Option C: Clean old data — PostgreSQL:

Small/Medium (bundled)
Large / Hetzner bootstrap

# Connect to the primary and run vacuum (use ANALYZE, not FULL — FULL needs
# extra disk space equal to table size, which you don't have when the volume is full)
kubectl exec -it -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \
  -- psql -U ironflow -d ironflow -c "VACUUM ANALYZE;"

# Check for bloat in large tables
kubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \
  -- psql -U ironflow -d ironflow -c "
    SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
    FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"

kubectl exec -it -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \
  -- psql -U ironflow -d ironflow -c "VACUUM ANALYZE;"

kubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \
  -- psql -U ironflow -d ironflow -c "
    SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
    FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"

Option D: Increase NATS retention settings to discard older messages sooner:

# Reduce JetStream file store PVC size
ironflow deploy upgrade --template medium --name my-release \
  --set nats.config.jetstream.fileStore.pvc.size=5Gi

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set nats.config.jetstream.fileStore.pvc.size=5Gi

See kubectl Operations for more NATS and PostgreSQL diagnostic commands.

Network partition between namespaces

Trigger: Ironflow can’t reach NATS or PostgreSQL across namespaces. Connection timeouts in logs. Applies to Large template only (Small/Medium run everything in the same namespace).

Steps:

# 1. Test connectivity from the ironflow namespace to NATS in ironflow-system
kubectl run nettest --image=busybox --rm -it --restart=Never -n ironflow \
  -- nc -zv nats.ironflow-system.svc.cluster.local 4222
# Expected: "open" — if "Connection timed out", there's a network issue

# 2. Test connectivity to PostgreSQL
kubectl run nettest --image=busybox --rm -it --restart=Never -n ironflow \
  -- nc -zv ironflow-db-rw.ironflow-system.svc.cluster.local 5432

# 3. Verify DNS resolution works
kubectl run dnstest --image=busybox --rm -it --restart=Never -n ironflow \
  -- nslookup nats.ironflow-system.svc.cluster.local

# 4. Check NetworkPolicy — are cross-namespace connections allowed?
kubectl get networkpolicy -n ironflow
kubectl get networkpolicy -n ironflow-system
# Look for rules that allow traffic from the ironflow namespace

# 5. If using Cilium, check Cilium pod health
kubectl get pods -n kube-system -l app.kubernetes.io/name=cilium
kubectl exec -n kube-system -l app.kubernetes.io/name=cilium -- cilium status

# 6. Check Cilium network policies (CiliumNetworkPolicy resources)
kubectl get ciliumnetworkpolicy --all-namespaces

# 7. If NetworkPolicy is blocking traffic, update to allow cross-namespace access
ironflow deploy upgrade --template large --name my-release \
  --set networkPolicy.allowNamespaces[0]=ironflow-system \
  --set networkPolicy.allowNamespaces[1]=monitoring

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values \
#   --set networkPolicy.allowNamespaces[0]=ironflow-system \
#   --set networkPolicy.allowNamespaces[1]=monitoring

If DNS is failing but the pods are running, check CoreDNS:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20

Stale scheduler claims (multi-node)

Trigger: Runs stuck in “waking” state. One or more Ironflow pods died while holding step claims. Applies to Medium/Large templates (cluster mode with multiple replicas).

How it works: In cluster mode, each Ironflow node claims steps using SKIP LOCKED in PostgreSQL. If a node dies, its claimed steps remain locked until the reclaim goroutine (runs every 60 seconds on all nodes) detects and resets them. The stale threshold defaults to 2 minutes (IRONFLOW_STALE_CLAIM_THRESHOLD).

Steps:

Small/Medium (bundled)
Large / Hetzner bootstrap

# 1. Check that all Ironflow pods are running
kubectl get pods -n ironflow -l app.kubernetes.io/component=server
# If pods are down, the reclaim goroutine can't run — restart them first

# 2. Check logs for reclaim activity
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=50 | grep -i "reclaim\|stale"

# 3. Connect to PostgreSQL to inspect stale claims
kubectl exec -it -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \
  -- psql -U ironflow -d ironflow

# Inside psql:
# Check for steps claimed more than 2 minutes ago (stale)
# SELECT id, run_id, status, claimed_by, claimed_at
#   FROM steps
#   WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes'
#   ORDER BY claimed_at;

# 4. If the automatic reclaim hasn't kicked in yet (wait up to 60s + 2m threshold),
#    you can manually reset stale claims:
# UPDATE steps SET status = 'pending', claimed_by = NULL, claimed_at = NULL
#   WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes';

# 1. Check that all Ironflow pods are running
kubectl get pods -n ironflow -l app.kubernetes.io/component=server

# 2. Check logs for reclaim activity
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=50 | grep -i "reclaim\|stale"

# 3. Connect to PostgreSQL to inspect stale claims
kubectl exec -it -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \
  -- psql -U ironflow -d ironflow

# Inside psql:
# SELECT id, run_id, status, claimed_by, claimed_at
#   FROM steps
#   WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes'
#   ORDER BY claimed_at;

# 4. Manual reset if needed:
# UPDATE steps SET status = 'pending', claimed_by = NULL, claimed_at = NULL
#   WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes';

Post-recovery checks:

# Verify runs are progressing again
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20 | grep -i "step\|run\|complet"

# If runs are still stuck, check that NATS is healthy (pull-mode dispatch goes through NATS)
kubectl exec -n <namespace> <nats-pod> -c nats -- wget -qO- http://localhost:8222/healthz

Tuning the stale threshold: If nodes frequently crash and 2 minutes is too long to wait, lower the threshold:

ironflow deploy upgrade --template medium --name my-release \
  --set cluster.staleClaimThreshold=60s

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set cluster.staleClaimThreshold=60s

See kubectl Operations for PostgreSQL connection instructions and NATS diagnostic commands.

Tenant hitting ResourceQuota limits (multi-tenant cluster)

Trigger: A tenant’s pods are stuck in Pending with events like exceeded quota or forbidden: exceeded quota.

Diagnose:

# 1. Check quota usage vs limits
kubectl describe resourcequota -n tenant-acme

# Example output showing a CPU limit hit:
#   Resource        Used    Hard
#   --------        ----    ----
#   limits.cpu      4       4      ← at limit
#   requests.cpu    2       2      ← at limit

# 2. Check which pods are consuming resources
kubectl top pods -n tenant-acme --sort-by=cpu

# 3. Check for pending pods and their events
kubectl get pods -n tenant-acme --field-selector=status.phase=Pending
kubectl describe pod <pending-pod> -n tenant-acme | grep -A5 Events

Resolution:

# Option A: Increase the tenant's quota
helm upgrade acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  --set ingress.host=acme.ironflow.example.com \
  --set resourceQuota.cpu.limits=8 \
  --set resourceQuota.memory.limits=16Gi

# Option B: Reduce the tenant's resource requests if over-provisioned
helm upgrade acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  --set ingress.host=acme.ironflow.example.com \
  --set resources.requests.cpu=50m \
  --set resources.requests.memory=128Mi

Cross-tenant network leak diagnosis (multi-tenant cluster)

Trigger: You suspect a tenant can reach another tenant’s services, or a network policy is not enforcing isolation correctly.

Diagnose:

# 1. Verify default-deny policy exists in both namespaces
kubectl get networkpolicy -n tenant-acme
kubectl get networkpolicy -n tenant-globex
# Both should have: *-default-deny, *-allow-dns, *-allow-intra-namespace, and the ironflow policy

# 2. Test cross-tenant connectivity (should fail)
kubectl exec -n tenant-acme deploy/acme-ironflow -- \
  wget -qO- --timeout=3 http://globex-ironflow.tenant-globex:9123/health 2>&1 || echo "Blocked (expected)"

# 3. If it succeeds, check that defaultDeny is enabled
helm get values acme -n tenant-acme | grep -A2 networkPolicy
# networkPolicy.enabled should be true
# networkPolicy.defaultDeny should be true

# 4. Check the CNI supports NetworkPolicy (Calico, Cilium, etc.)
kubectl get pods -n kube-system | grep -E "calico|cilium"

Resolution: If defaultDeny was not enabled, upgrade the tenant:

helm upgrade acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  --set ingress.host=acme.ironflow.example.com
# values-multi-tenant.yaml already sets defaultDeny: true

CNI requirement

NetworkPolicy enforcement requires a CNI that supports it. Flannel does NOT enforce NetworkPolicies. Use Calico, Cilium, or another policy-aware CNI. Hetzner clusters provisioned with ironflow provision use Cilium by default.