Skip to content

Overview

Real-world playbooks for situations you will encounter operating Ironflow on Kubernetes. Each scenario has a trigger (how you know it’s happening), diagnosis steps, and resolution.

CategoryWhat it covers
Demo ClustersEnd-to-end single-tenant demo cluster on Hetzner with full SRE: provisioning, secrets, deploy, DNS/TLS, monitoring, Slack alerts, Healthchecks.io, teardown
DeploymentBootstrap, rolling updates, rollbacks, zero-downtime migrations, monitoring stack setup/removal, tenant onboarding/offboarding, cluster teardown
ScalingReplicas, scale down for cost, worker nodes, NATS cluster, PgBouncer connection pooling, PostgreSQL failover, tenant resource scaling
Failure & RecoveryCrashLoopBackOff, NATS quorum loss, PostgreSQL failover, node failure, PVC full, network partitions, stale scheduler claims, tenant quota exhaustion, cross-tenant network isolation
Alerts & MonitoringFirst alert triage, broken Slack notifications, dead man’s switch, Prometheus storage, Grafana “No Data”, alert tuning and silencing
Security & AccessRotate credentials: Slack webhook, Grafana admin, Ironflow master key, database
MaintenanceNode drain, Helm value changes, Prometheus backup, PrometheusRule/dashboard cleanup, tenant isolation audit, bulk tenant upgrades, tenant key rotation
Disaster RecoveryFull cluster rebuild, PostgreSQL restore from CNPG backup, namespace recovery, single-tenant restore

For day-to-day kubectl commands, see kubectl Operations. For alert-specific runbooks, see Runbooks.