Overview
Operational Scenarios
Section titled “Operational Scenarios”Real-world playbooks for situations you will encounter operating Ironflow on Kubernetes. Each scenario has a trigger (how you know it’s happening), diagnosis steps, and resolution.
| Category | What it covers |
|---|---|
| Demo Clusters | End-to-end single-tenant demo cluster on Hetzner with full SRE: provisioning, secrets, deploy, DNS/TLS, monitoring, Slack alerts, Healthchecks.io, teardown |
| Deployment | Bootstrap, rolling updates, rollbacks, zero-downtime migrations, monitoring stack setup/removal, tenant onboarding/offboarding, cluster teardown |
| Scaling | Replicas, scale down for cost, worker nodes, NATS cluster, PgBouncer connection pooling, PostgreSQL failover, tenant resource scaling |
| Failure & Recovery | CrashLoopBackOff, NATS quorum loss, PostgreSQL failover, node failure, PVC full, network partitions, stale scheduler claims, tenant quota exhaustion, cross-tenant network isolation |
| Alerts & Monitoring | First alert triage, broken Slack notifications, dead man’s switch, Prometheus storage, Grafana “No Data”, alert tuning and silencing |
| Security & Access | Rotate credentials: Slack webhook, Grafana admin, Ironflow master key, database |
| Maintenance | Node drain, Helm value changes, Prometheus backup, PrometheusRule/dashboard cleanup, tenant isolation audit, bulk tenant upgrades, tenant key rotation |
| Disaster Recovery | Full cluster rebuild, PostgreSQL restore from CNPG backup, namespace recovery, single-tenant restore |
For day-to-day kubectl commands, see kubectl Operations. For alert-specific runbooks, see Runbooks.