Failure & Recovery Scenarios
Ironflow pods crashing (CrashLoopBackOff)
Trigger: kubectl get pods shows CrashLoopBackOff. Pods start and immediately exit.
Steps:
# 1. Check pod events for scheduling or pull errorskubectl describe pod -n ironflow -l app.kubernetes.io/component=server | tail -30
# 2. Check logs from the crashed container (--previous gets logs from the last crash)kubectl logs -n ironflow -l app.kubernetes.io/component=server --previous --tail=50Identify the cause from the logs and follow the appropriate branch:
OOM Killed — logs show OOMKilled in pod status or Killed in dmesg:
# Confirm OOMkubectl get pods -n ironflow -l app.kubernetes.io/component=server -o jsonpath='{.items[*].status.containerStatuses[*].lastState.terminated.reason}'# If "OOMKilled", increase memory limitsironflow deploy upgrade --template medium --name my-release \ --set resources.limits.memory=1Gi \ --set resources.requests.memory=512Mi
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values \# --set resources.limits.memory=1Gi \# --set resources.requests.memory=512MiBad configuration — logs show config parse errors, missing env vars, or invalid flags:
# Inspect the ConfigMap and environmentkubectl get configmap -n ironflow -l app.kubernetes.io/instance=ironflow -o yamlkubectl get deployment ironflow -n ironflow -o jsonpath='{.spec.template.spec.containers[0].env}' | jq .Dependency unavailable — logs show connection refused to NATS or PostgreSQL:
# Check NATS healthkubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/healthz
# Check PostgreSQL healthkubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \ -- pg_isready -h localhost# Check NATS healthkubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/healthz
# Check PostgreSQL healthkubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \ -- pg_isready -h localhostIf NATS or PostgreSQL is down, resolve the dependency first. See IronflowDown runbook for the full decision tree.
NATS cluster loses quorum
Trigger: 2 of 3 NATS pods down. JetStream operations failing. Workers disconnecting. Applies to Medium template only (Small uses a single NATS pod; Large uses external NATS).
Steps:
# 1. Check NATS pod statuskubectl get pods -n ironflow -l app.kubernetes.io/name=nats
# 2. Check events on failing podskubectl describe pod -n ironflow ironflow-nats-1kubectl describe pod -n ironflow ironflow-nats-2
# 3. Check PVCs — if stuck in Pending, the volume couldn't bindkubectl get pvc -n ironflow -l app.kubernetes.io/name=nats
# 4. Check logs for raft election or storage errorskubectl logs -n ironflow ironflow-nats-0 -c nats --tail=30kubectl logs -n ironflow ironflow-nats-1 -c nats --tail=30
# 5. Delete stuck pods to let the StatefulSet controller recreate themkubectl delete pod -n ironflow ironflow-nats-1kubectl delete pod -n ironflow ironflow-nats-2
# 6. Wait for pods to come backkubectl rollout status statefulset/ironflow-nats -n ironflow --timeout=120s
# 7. Verify cluster health — all 3 nodes should appear in routeskubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/routez | python3 -m json.tool# "num_routes" should equal 2 (each node has routes to the other 2)
# 8. Verify JetStream is operationalkubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/jsz# 1. Check NATS pod statuskubectl get pods -n ironflow-system -l app.kubernetes.io/name=nats
# 2. Check events on failing podskubectl describe pod -n ironflow-system nats-1kubectl describe pod -n ironflow-system nats-2
# 3. Check PVCskubectl get pvc -n ironflow-system -l app.kubernetes.io/name=nats
# 4. Check logskubectl logs -n ironflow-system nats-0 -c nats --tail=30kubectl logs -n ironflow-system nats-1 -c nats --tail=30
# 5. Delete stuck pods to let the StatefulSet controller recreate themkubectl delete pod -n ironflow-system nats-1kubectl delete pod -n ironflow-system nats-2
# 6. Wait for pods to come backkubectl rollout status statefulset/nats -n ironflow-system --timeout=120s
# 7. Verify cluster healthkubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/routez | python3 -m json.tool
# 8. Verify JetStreamkubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/jszIf pods keep crashing after recreation, check the NATS JetStream data directory for corruption. As a last resort, delete the PVCs (this loses JetStream state — Ironflow recreates streams on startup, but in-flight messages are lost):
# CAUTION: data loss — only if cluster cannot recoverkubectl delete pvc -l app.kubernetes.io/name=nats -n <namespace>kubectl delete pod -l app.kubernetes.io/name=nats -n <namespace>See NATSDown runbook for additional diagnostics.
PostgreSQL primary down (unplanned failover)
Trigger: CNPG auto-promotes a replica to primary. Brief 503s from Ironflow during failover (typically < 30 seconds).
Steps:
# 1. Check CNPG cluster statuskubectl get cluster ironflow-postgresql -n ironflow# STATUS should transition from "Failover in progress" to "Cluster in healthy state"
# 2. Monitor the failover progresskubectl describe cluster ironflow-postgresql -n ironflow | grep -A 10 "Status:"
# 3. Identify the new primarykubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary# A different pod should now be primary
# 4. Verify the new primary is accepting connectionskubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \ -- pg_isready -h localhost
# 5. Verify Ironflow has reconnectedkubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20 | grep -i "database\|postgres\|connect"
# 6. Check that the readiness probe is passing againkubectl get pods -n ironflow -l app.kubernetes.io/component=server# All pods should show 1/1 Ready# 1. Check CNPG cluster statuskubectl get cluster ironflow-db -n ironflow-system# STATUS should transition from "Failover in progress" to "Cluster in healthy state"
# 2. Monitor the failover progresskubectl describe cluster ironflow-db -n ironflow-system | grep -A 10 "Status:"
# 3. Identify the new primarykubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary
# 4. Verify the new primary is accepting connectionskubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \ -- pg_isready -h localhost
# 5. Verify Ironflow has reconnectedkubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20 | grep -i "database\|postgres\|connect"
# 6. Check readinesskubectl get pods -n ironflow -l app.kubernetes.io/component=serverPost-failover notes:
- CNPG handles promotion automatically. No manual intervention is needed unless the cluster status stays unhealthy.
- The old primary pod will restart as a replica. Check that it rejoins:
kubectl get pods -l cnpg.io/cluster=<cluster-name> -n <namespace>. - If Ironflow pods do not recover within 60 seconds, restart them:
kubectl rollout restart deployment/ironflow -n ironflow.
See PostgreSQLDown runbook for extended diagnostics and manual promotion procedures.
Node failure (Hetzner VM dies)
Trigger: kubectl get nodes shows a node NotReady. Pods on that node evicted.
Steps:
# 1. Check node statuskubectl get nodes -o wide# Look for NotReady status and the affected node name
# 2. Check what was running on the failed nodekubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
# 3. Kubernetes reschedules pods automatically (PDB ensures availability)# Watch pod rescheduling — new pods should appear on healthy nodeskubectl get pods -n ironflow -w
# 4. Verify all components are healthy after reschedulingkubectl get pods -n ironflow -l app.kubernetes.io/component=serverVerify infrastructure pods have rescheduled:
kubectl get pods -n ironflow -l app.kubernetes.io/name=natskubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresqlkubectl get pods -n ironflow-system -l app.kubernetes.io/name=natskubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-dbIf the node won’t recover:
# 5. Remove the dead node from the clusterkubectl delete node <node-name>
# 6. Replace the node via Terraformcd deploy/terraform/hetznerterraform plan # Review — should show 1 node to replaceterraform apply # Creates new VM, joins cluster
# Or via hcloud CLI if not using Terraformhcloud server delete <server-name># Then re-run provisioning to add a replacement nodeironflow provision create --provider hetzner --template medium --name ironflowStatefulSet pods (NATS, PostgreSQL): These use PVCs backed by networked storage. When the pod reschedules to a new node, Kubernetes reattaches the volume. This may take 1-2 minutes while the volume detaches from the dead node. If the volume stays stuck in Attaching state for more than 5 minutes, force-detach via the Hetzner Cloud Console or hcloud volume detach <volume-id>.
Persistent volume full (NATS or PostgreSQL)
Trigger: DiskSpaceLow alert fires. JetStream or PostgreSQL can’t write.
Steps:
# 1. Check PVC usage — NATSkubectl exec -n ironflow ironflow-nats-0 -c nats -- df -h /data# Look at Use% — over 85% is concerning, over 95% is critical
# 2. Check PVC usage — PostgreSQLkubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \ -- df -h /var/lib/postgresql/data
# 3. Check current PVC sizeskubectl get pvc -n ironflow# 1. Check PVC usage — NATSkubectl exec -n ironflow-system nats-0 -c nats -- df -h /data
# 2. Check PVC usage — PostgreSQLkubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \ -- df -h /var/lib/postgresql/data
# 3. Check current PVC sizeskubectl get pvc -n ironflow-systemOption A: Expand the PVC (if the storage class supports volume expansion):
# Check if expansion is allowedkubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}: allowVolumeExpansion={.allowVolumeExpansion}{"\n"}{end}'
# Expand a PVC (example: NATS data volume)kubectl patch pvc <pvc-name> -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'# The expansion happens online — no pod restart needed for most CSI driversOption B: Clean old data — NATS:
# Check JetStream stream sizeskubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/jsz?streams=true
# Purge a specific stream (removes all messages — use with caution)# This is safe for Ironflow because historical data lives in PostgreSQL, not NATSkubectl exec -n ironflow ironflow-nats-0 -c nats -- nats stream purge <stream-name> --forcekubectl exec -n ironflow-system nats-0 -c nats -- wget -qO- http://localhost:8222/jsz?streams=true
kubectl exec -n ironflow-system nats-0 -c nats -- nats stream purge <stream-name> --forceOption C: Clean old data — PostgreSQL:
# Connect to the primary and run vacuum (use ANALYZE, not FULL — FULL needs# extra disk space equal to table size, which you don't have when the volume is full)kubectl exec -it -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \ -- psql -U ironflow -d ironflow -c "VACUUM ANALYZE;"
# Check for bloat in large tableskubectl exec -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \ -- psql -U ironflow -d ironflow -c " SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"kubectl exec -it -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \ -- psql -U ironflow -d ironflow -c "VACUUM ANALYZE;"
kubectl exec -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \ -- psql -U ironflow -d ironflow -c " SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"Option D: Increase NATS retention settings to discard older messages sooner:
# Reduce JetStream file store PVC sizeironflow deploy upgrade --template medium --name my-release \ --set nats.config.jetstream.fileStore.pvc.size=5Gi
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set nats.config.jetstream.fileStore.pvc.size=5GiSee kubectl Operations for more NATS and PostgreSQL diagnostic commands.
Network partition between namespaces
Trigger: Ironflow can’t reach NATS or PostgreSQL across namespaces. Connection timeouts in logs. Applies to Large template only (Small/Medium run everything in the same namespace).
Steps:
# 1. Test connectivity from the ironflow namespace to NATS in ironflow-systemkubectl run nettest --image=busybox --rm -it --restart=Never -n ironflow \ -- nc -zv nats.ironflow-system.svc.cluster.local 4222# Expected: "open" — if "Connection timed out", there's a network issue
# 2. Test connectivity to PostgreSQLkubectl run nettest --image=busybox --rm -it --restart=Never -n ironflow \ -- nc -zv ironflow-db-rw.ironflow-system.svc.cluster.local 5432
# 3. Verify DNS resolution workskubectl run dnstest --image=busybox --rm -it --restart=Never -n ironflow \ -- nslookup nats.ironflow-system.svc.cluster.local
# 4. Check NetworkPolicy — are cross-namespace connections allowed?kubectl get networkpolicy -n ironflowkubectl get networkpolicy -n ironflow-system# Look for rules that allow traffic from the ironflow namespace
# 5. If using Cilium, check Cilium pod healthkubectl get pods -n kube-system -l app.kubernetes.io/name=ciliumkubectl exec -n kube-system -l app.kubernetes.io/name=cilium -- cilium status
# 6. Check Cilium network policies (CiliumNetworkPolicy resources)kubectl get ciliumnetworkpolicy --all-namespaces
# 7. If NetworkPolicy is blocking traffic, update to allow cross-namespace accessironflow deploy upgrade --template large --name my-release \ --set networkPolicy.allowNamespaces[0]=ironflow-system \ --set networkPolicy.allowNamespaces[1]=monitoring
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values \# --set networkPolicy.allowNamespaces[0]=ironflow-system \# --set networkPolicy.allowNamespaces[1]=monitoringIf DNS is failing but the pods are running, check CoreDNS:
kubectl get pods -n kube-system -l k8s-app=kube-dnskubectl logs -n kube-system -l k8s-app=kube-dns --tail=20Stale scheduler claims (multi-node)
Trigger: Runs stuck in “waking” state. One or more Ironflow pods died while holding step claims. Applies to Medium/Large templates (cluster mode with multiple replicas).
How it works: In cluster mode, each Ironflow node claims steps using SKIP LOCKED in PostgreSQL. If a node dies, its claimed steps remain locked until the reclaim goroutine (runs every 60 seconds on all nodes) detects and resets them. The stale threshold defaults to 2 minutes (IRONFLOW_STALE_CLAIM_THRESHOLD).
Steps:
# 1. Check that all Ironflow pods are runningkubectl get pods -n ironflow -l app.kubernetes.io/component=server# If pods are down, the reclaim goroutine can't run — restart them first
# 2. Check logs for reclaim activitykubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=50 | grep -i "reclaim\|stale"
# 3. Connect to PostgreSQL to inspect stale claimskubectl exec -it -n ironflow -l cnpg.io/cluster=ironflow-postgresql -l role=primary \ -- psql -U ironflow -d ironflow
# Inside psql:# Check for steps claimed more than 2 minutes ago (stale)# SELECT id, run_id, status, claimed_by, claimed_at# FROM steps# WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes'# ORDER BY claimed_at;
# 4. If the automatic reclaim hasn't kicked in yet (wait up to 60s + 2m threshold),# you can manually reset stale claims:# UPDATE steps SET status = 'pending', claimed_by = NULL, claimed_at = NULL# WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes';# 1. Check that all Ironflow pods are runningkubectl get pods -n ironflow -l app.kubernetes.io/component=server
# 2. Check logs for reclaim activitykubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=50 | grep -i "reclaim\|stale"
# 3. Connect to PostgreSQL to inspect stale claimskubectl exec -it -n ironflow-system -l cnpg.io/cluster=ironflow-db -l role=primary \ -- psql -U ironflow -d ironflow
# Inside psql:# SELECT id, run_id, status, claimed_by, claimed_at# FROM steps# WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes'# ORDER BY claimed_at;
# 4. Manual reset if needed:# UPDATE steps SET status = 'pending', claimed_by = NULL, claimed_at = NULL# WHERE status = 'waking' AND claimed_at < NOW() - INTERVAL '2 minutes';Post-recovery checks:
# Verify runs are progressing againkubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20 | grep -i "step\|run\|complet"
# If runs are still stuck, check that NATS is healthy (pull-mode dispatch goes through NATS)kubectl exec -n <namespace> <nats-pod> -c nats -- wget -qO- http://localhost:8222/healthzTuning the stale threshold: If nodes frequently crash and 2 minutes is too long to wait, lower the threshold:
ironflow deploy upgrade --template medium --name my-release \ --set cluster.staleClaimThreshold=60s
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set cluster.staleClaimThreshold=60sSee kubectl Operations for PostgreSQL connection instructions and NATS diagnostic commands.
Tenant hitting ResourceQuota limits (multi-tenant cluster)
Trigger: A tenant’s pods are stuck in Pending with events like exceeded quota or forbidden: exceeded quota.
Diagnose:
# 1. Check quota usage vs limitskubectl describe resourcequota -n tenant-acme
# Example output showing a CPU limit hit:# Resource Used Hard# -------- ---- ----# limits.cpu 4 4 ← at limit# requests.cpu 2 2 ← at limit
# 2. Check which pods are consuming resourceskubectl top pods -n tenant-acme --sort-by=cpu
# 3. Check for pending pods and their eventskubectl get pods -n tenant-acme --field-selector=status.phase=Pendingkubectl describe pod <pending-pod> -n tenant-acme | grep -A5 EventsResolution:
# Option A: Increase the tenant's quotahelm upgrade acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --set ingress.host=acme.ironflow.example.com \ --set resourceQuota.cpu.limits=8 \ --set resourceQuota.memory.limits=16Gi
# Option B: Reduce the tenant's resource requests if over-provisionedhelm upgrade acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --set ingress.host=acme.ironflow.example.com \ --set resources.requests.cpu=50m \ --set resources.requests.memory=128MiCross-tenant network leak diagnosis (multi-tenant cluster)
Trigger: You suspect a tenant can reach another tenant’s services, or a network policy is not enforcing isolation correctly.
Diagnose:
# 1. Verify default-deny policy exists in both namespaceskubectl get networkpolicy -n tenant-acmekubectl get networkpolicy -n tenant-globex# Both should have: *-default-deny, *-allow-dns, *-allow-intra-namespace, and the ironflow policy
# 2. Test cross-tenant connectivity (should fail)kubectl exec -n tenant-acme deploy/acme-ironflow -- \ wget -qO- --timeout=3 http://globex-ironflow.tenant-globex:9123/health 2>&1 || echo "Blocked (expected)"
# 3. If it succeeds, check that defaultDeny is enabledhelm get values acme -n tenant-acme | grep -A2 networkPolicy# networkPolicy.enabled should be true# networkPolicy.defaultDeny should be true
# 4. Check the CNI supports NetworkPolicy (Calico, Cilium, etc.)kubectl get pods -n kube-system | grep -E "calico|cilium"Resolution: If defaultDeny was not enabled, upgrade the tenant:
helm upgrade acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --set ingress.host=acme.ironflow.example.com# values-multi-tenant.yaml already sets defaultDeny: trueCNI requirement
NetworkPolicy enforcement requires a CNI that supports it. Flannel does NOT enforce NetworkPolicies. Use Calico, Cilium, or another policy-aware CNI. Hetzner clusters provisioned with ironflow provision use Cilium by default.