Demo Clusters
Deploy a single-tenant Ironflow demo cluster on Hetzner Cloud with full SRE observability. This playbook produces a production-grade cluster with CEL policies, auth audit logging, platform API, Prometheus, Grafana dashboards, Slack alerting, Healthchecks.io dead-man’s switch, TLS, and automated PostgreSQL backups for approximately €21/month.
When to use this playbook:
- You need a live demo environment to showcase Ironflow
- You want to validate the Small template deployment end-to-end
- You need a staging cluster with full monitoring for development
Architecture
HETZNER CLOUD (~€21/month) │ ┌───────────────┴───────────────┐ ▼ ▼ 1x cpx22 Control 1x cpx32 Worker (8GB) (Talos Linux) (Talos Linux) │ ┌──────────────────────┬────┴────────────────┐ ▼ ▼ ▼ Ironflow (512Mi) NATS (256Mi) PostgreSQL (512Mi) :9123 HTTP/gRPC JetStream 2Gi CNPG 5Gi + S3 backups │ │ │ └──────────────────────┴─────────────────────┘ │ Internet ──▶ Hetzner LB (€6/mo) ──▶ Traefik ──▶ TLS ──▶ Ironflow │ kube-prometheus-stack (~1.5Gi) ├── Prometheus (512Mi, 10Gi PVC, 15d retention) ├── Grafana (256Mi, 4 dashboards) ├── Alertmanager → Slack + Healthchecks.io └── kube-state-metrics (256Mi)Component Sizing
| Component | RAM Request | RAM Limit | Storage | Notes |
|---|---|---|---|---|
| Ironflow | 256Mi | 512Mi | — | Single replica, bundled mode |
| NATS | 128Mi | 256Mi | 2Gi PVC | JetStream, single node |
| PostgreSQL | 256Mi | 512Mi | 5Gi PVC | CloudNativePG, 1 instance |
| Prometheus | 512Mi | 1Gi | 10Gi PVC | 15-day retention, hcloud-volumes |
| Grafana | 128Mi | 256Mi | — | 4 pre-built dashboards |
| Alertmanager | 50Mi | 128Mi | — | Slack + Healthchecks.io routing |
| kube-state-metrics | 128Mi | 256Mi | — | Kubernetes metrics |
| Traefik | 128Mi | 256Mi | — | 2 replicas, TLS termination |
| System pods | — | — | — | Cilium, CCM, cert-manager (~700Mi) |
| Total | ~2.5Gi used on 8Gi worker |
Cost Breakdown
| Component | Monthly Cost |
|---|---|
| 1x cpx22 (control plane, 3 vCPU, 4GB) | ~€4.50 |
| 1x cpx32 (worker, 4 vCPU, 8GB) | ~€8.50 |
| Hetzner LB11 (load balancer) | ~€6.00 |
| Object Storage (PostgreSQL backups) | ~€1.00 |
| Total | ~€20-21/month |
Isolation Model
This is a single-tenant demo cluster. All traffic goes to one Ironflow instance with one project and environment. For multi-app shared clusters, see issue #373.
| Layer | Mechanism | What it provides |
|---|---|---|
| API Key | Bound to one environment | Request authentication |
| Database filtering | WHERE environment_id = $N on all queries | Data isolation |
| NATS subject naming | ironflow.{project}.{env}.events.> | Message isolation |
| TLS | cert-manager + Let’s Encrypt | Encrypted transport |
Prerequisites
Before starting, ensure you have:
| Requirement | How to get it |
|---|---|
| Hetzner Cloud API token | Hetzner Console → Security → API Tokens |
GitHub PAT (read:packages scope) | GitHub Settings → Personal Access Tokens |
| Hetzner Object Storage bucket | Hetzner Console → Object Storage → Create Bucket (name: ironflow-backups, private) |
| Domain with DNS access | Any DNS provider (e.g., Cloudflare, Route 53) |
| Slack webhook URL | Slack App → Incoming Webhooks → Create |
| Healthchecks.io ping URL | Healthchecks.io → New Check (5-minute period, free tier) |
Set environment variables (or add them to a .env file and source it). If using a .env file, ensure it is in .gitignore to avoid committing credentials.
export HCLOUD_TOKEN="your-hetzner-api-token"export GITHUB_USERNAME="your-github-username"export GITHUB_PAT="ghp_your-personal-access-token"export HETZNER_S3_ACCESS_KEY="your-s3-access-key"export HETZNER_S3_SECRET_KEY="your-s3-secret-key"export HETZNER_S3_ENDPOINT="https://fsn1.your-objectstorage.com"export HETZNER_S3_BUCKET="ironflow-backups"export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../xxx"export HEALTHCHECKS_PING_URL="https://hc-ping.com/your-uuid"Source with source .env.
set -x HCLOUD_TOKEN "your-hetzner-api-token"set -x GITHUB_USERNAME "your-github-username"set -x GITHUB_PAT "ghp_your-personal-access-token"set -x HETZNER_S3_ACCESS_KEY "your-s3-access-key"set -x HETZNER_S3_SECRET_KEY "your-s3-secret-key"set -x HETZNER_S3_ENDPOINT "https://fsn1.your-objectstorage.com"set -x HETZNER_S3_BUCKET "ironflow-backups"set -x SLACK_WEBHOOK_URL "https://hooks.slack.com/services/T.../B.../xxx"set -x HEALTHCHECKS_PING_URL "https://hc-ping.com/your-uuid"Fish does not source bash-style .env files. Either keep vars in ~/.config/fish/conf.d/ironflow.fish or use bass: bass source .env.
Phase 0: Build the Ironflow Binary
Build the Ironflow binary from source. This produces ./build/ironflow, which is used for all subsequent commands.
make allVerify:
./build/ironflow versionPhase 1: Provision the Cluster
Provision a two-node Hetzner cluster using the Small template (1 control plane + 1 worker). The worker uses cpx32 (8GB RAM) to accommodate the full monitoring stack.
# Provision cluster (~5-8 minutes)# Installs: Talos Linux, Cilium CNI, cert-manager, Hetzner CCM + CSI./build/ironflow provision create --provider hetzner --template small --name demoThe provisioning command saves the kubeconfig to ~/.kube/clusters/hetzner-demo.yaml. Set it for all subsequent commands.
export KUBECONFIG=~/.kube/clusters/hetzner-demo.yamlset -x KUBECONFIG $HOME/.kube/clusters/hetzner-demo.yamlFish + tilde gotcha
export KUBECONFIG=~/.kube/... does not work in fish. Tilde is only expanded at the start of a word, not after =, so KUBECONFIG ends up as the literal string ~/.kube/.... kubectl and helm then fail to open the file and silently fall back to localhost:8080, producing errors like kubernetes cluster unreachable: Get "http://localhost:8080/version". Use $HOME or an absolute path.
Verify the variable resolved to a real path:
echo $KUBECONFIG # must show /Users/.../hetzner-demo.yaml, not ~/...kubectl cluster-info # must hit the Hetzner API server, not localhost:8080Verify:
kubectl get nodesExpected output (both nodes Ready):
NAME STATUS ROLES AGE VERSIONcontrol-demo-... Ready control-plane 5m v1.31.xworker-demo-... Ready <none> 4m v1.31.xTroubleshooting
If nodes show NotReady, wait 2-3 minutes for Talos to finish bootstrapping. The provisioning command retries up to 30 times (5 minutes total). If nodes remain NotReady after 10 minutes, check Hetzner Console for node status and run talosctl health for diagnostics.
Phase 2: Create Secrets
Create the namespaces and secrets that the deployment and monitoring stack depend on. These must exist before running ironflow deploy.
Ironflow Secrets
# Create namespacekubectl create namespace ironflow
# Container registry credentials (for pulling Ironflow images from GHCR)kubectl create secret docker-registry ghcr-pull-secret -n ironflow \ --docker-server=ghcr.io \ --docker-username=$GITHUB_USERNAME \ --docker-password=$GITHUB_PAT
# S3 backup credentials (for PostgreSQL WAL archiving and daily backups)kubectl create secret generic ironflow-s3-creds -n ironflow \ --from-literal=ACCESS_KEY_ID="$HETZNER_S3_ACCESS_KEY" \ --from-literal=SECRET_ACCESS_KEY="$HETZNER_S3_SECRET_KEY"Monitoring Secrets
# Create namespacekubectl create namespace monitoring
# Slack webhook for alert notificationskubectl create secret generic alertmanager-slack -n monitoring \ --from-literal=webhook-url="$SLACK_WEBHOOK_URL"
# Healthchecks.io dead-man's switch (Watchdog alert pings this URL every minute;# if it stops, Healthchecks.io alerts you that the monitoring stack itself is down)kubectl create secret generic healthchecks-io -n monitoring \ --from-literal=ping-url="$HEALTHCHECKS_PING_URL"
# Grafana admin credentialskubectl create secret generic grafana-admin -n monitoring \ --from-literal=admin-user=admin \ --from-literal=admin-password="$(openssl rand -base64 16)"Retrieve Grafana password later
kubectl get secret grafana-admin -n monitoring \ -o jsonpath='{.data.admin-password}' | base64 -d && echoVerify:
kubectl get secrets -n ironflowkubectl get secrets -n monitoringExpected: ghcr-pull-secret, ironflow-s3-creds in ironflow namespace. alertmanager-slack, healthchecks-io, grafana-admin in monitoring namespace.
Phase 3: Deploy Ironflow
Deploy Ironflow with the Small template and Hetzner load balancer support. This command automatically installs all prerequisites:
- CloudNativePG (PostgreSQL operator)
- Barman Cloud Plugin (S3-compatible backups)
- kube-prometheus-stack (Prometheus, Grafana, Alertmanager with full Hetzner config)
- Traefik (ingress controller with Hetzner LB annotations)
Build and push the image
Build a Docker image from the current code and push it to GHCR. The cluster will pull this image during deployment.
# Build and push the image (linux/amd64 for Hetzner)docker build -t ghcr.io/sahina/ironflow:latest \ --build-arg VERSION="$(git rev-parse --short HEAD)" \ --platform linux/amd64 .echo $GITHUB_PAT | docker login ghcr.io -u $GITHUB_USERNAME --password-stdindocker push ghcr.io/sahina/ironflow:latestDeploy
Generate the master key first
The masterKey encrypts secrets stored in NATS KV (AES-256). If you lose it, encrypted secrets become unreadable. Generate and save it before running deploy. You must pass the same key on every ironflow deploy upgrade command, or the Helm Secret will be overwritten without it.
export IRONFLOW_MASTER_KEY=$(openssl rand -hex 32)echo "SAVE THIS KEY: $IRONFLOW_MASTER_KEY"set -x IRONFLOW_MASTER_KEY (openssl rand -hex 32)echo "SAVE THIS KEY: $IRONFLOW_MASTER_KEY"# Deploy Ironflow with full SRE stack# The --hetzner-location flag enables Traefik + load balancer + full monitoring config./build/ironflow deploy --template small --name demo \ --hetzner-location fsn1 \ --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \ --set image.tag=latestWait for all pods (3-5 minutes, CNPG bootstrap is the slowest):
# Watch Ironflow podskubectl get pods -n ironflow -w
# Watch monitoring podskubectl get pods -n monitoring -w
# Verify PostgreSQL cluster is healthykubectl get cluster -n ironflow# Expected: Phase = "Cluster in healthy state"
# Verify daily backups are scheduledkubectl get scheduledbackups -n ironflow# Expected: 1 ScheduledBackup (daily at 2 AM)What gets deployed
Namespace: ironflow ├── demo-ironflow (Deployment, 1 replica) │ ├── ironflow container (:9123) │ ├── CEL policy evaluator, auth audit recorder, platform API │ └── init containers (wait-for-pg, wait-for-nats) ├── demo-ironflow-pg (CNPG Cluster, 1 instance) │ ├── PostgreSQL 16 │ ├── 5Gi PVC (hcloud-volumes) │ └── S3 WAL archiving (Barman Cloud Plugin) ├── demo-nats (StatefulSet, 1 replica) │ ├── NATS with JetStream │ ├── 2Gi PVC (hcloud-volumes) │ └── Prometheus exporter sidecar (:7777) └── demo-ironflow-pg-backup (ScheduledBackup, daily 2 AM)
Namespace: monitoring ├── kube-prometheus-stack-prometheus (StatefulSet) │ ├── 10Gi PVC (hcloud-volumes, 15-day retention) │ └── ServiceMonitor + PodMonitor selectors (all namespaces) ├── kube-prometheus-stack-grafana (Deployment) │ ├── Dashboard sidecar (auto-discovers ConfigMaps) │ └── Admin from grafana-admin secret ├── kube-prometheus-stack-alertmanager (StatefulSet) │ ├── Slack receiver (#ironflow-alerts, 4h repeat) │ ├── Healthchecks.io receiver (Watchdog, 1m repeat) │ └── Inhibition: critical suppresses warning └── kube-state-metrics (Deployment)
Namespace: traefik └── traefik (Deployment, 2 replicas) ├── Hetzner LB annotations (private network, proxy protocol) └── Prometheus ServiceMonitor
Namespace: cnpg-system ├── cloudnative-pg (Deployment, CNPG operator) └── barman-cloud-instance-manager (Deployment)Phase 4: DNS and Ingress
Ingress is enabled in a two-phase process to avoid burning Let’s Encrypt rate limits. cert-manager requests a TLS certificate via HTTP-01 challenge, which requires DNS to resolve to the load balancer IP first.
Step 1: Get the Load Balancer IP
# Wait for Hetzner to provision the load balancer (1-2 minutes)kubectl get svc -n traefik traefik -wWait until EXTERNAL-IP shows an IP address (not <pending>). Note this IP.
Step 2: Create DNS Record
At your DNS provider, create an A record:
demo.ironflow.dev → A <EXTERNAL-IP>Step 3: Verify DNS Propagation
Do NOT skip this step
Enabling ingress before DNS propagates triggers a Let’s Encrypt HTTP-01 challenge that fails. Let’s Encrypt has rate limits (5 failures per hour per domain). Failed attempts are not recoverable within that window.
dig demo.ironflow.dev +short# MUST return the load balancer IP. If empty, wait and retry.Step 4: Enable Ingress with TLS
Only run this after dig returns the correct IP:
./build/ironflow deploy upgrade --template small --name demo \ --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \ --set image.tag=latest \ --set ingress.enabled=true \ --set ingress.host=demo.ironflow.dev \ --set ingress.className=traefikVerify TLS certificate:
# Check certificate status (may take 1-2 minutes)kubectl get certificate -n ironflow# Expected: READY = True
# Test HTTPScurl -I https://demo.ironflow.dev/health# Expected: HTTP/2 200
curl https://demo.ironflow.dev/ready# Expected: {"status":"ok"}Traffic Flow
Client HTTPS request │ ▼Hetzner LB (Layer 4, TCP, proxy protocol) │ Private network ▼Traefik (TLS termination, hostname routing) │ Proxy headers (X-Forwarded-For, X-Real-IP) ▼Ironflow Service (ClusterIP :9123) │ ├──▶ HTTP API (REST + ConnectRPC) ├──▶ gRPC API (pull-mode workers) └──▶ Dashboard (embedded React app)Phase 5: Configure Ironflow
Use the CLI to create a project, environment, and API key for the demo. The CLI targets localhost:9123 by default, so use port-forwarding to access the remote cluster.
# Port-forward Ironflow service to localhostkubectl port-forward svc/demo-ironflow -n ironflow 9123:9123 &PF_PID=$!
# Extract admin credentials from startup logskubectl logs -n ironflow $(kubectl get pods -n ironflow \ -l app.kubernetes.io/component=server -o name | head -1) \ | grep -A12 "Admin API Key"# This shows:# Admin API Key: ifkey_... (for CLI and SDK authentication)# Dashboard Admin:# Email: admin@ironflow.local# Password: <random> (for dashboard login at /login)# Save both — they are only shown on first boot.
# Create a demo project and environment./build/ironflow project create demo./build/ironflow env create production --project demo
# Create an API key for the demo environment./build/ironflow apikey create "demo-prod" --env env_demo_production# Save the returned API key (ifkey_...)
# Stop port-forwardkill $PF_PIDVerify Capabilities
# Confirm the server is healthy and capabilities are reportedcurl -s https://demo.ironflow.dev/api/v1/capabilities | jqWhat this demo cluster includes
- CEL policies — Custom role-based access control with Common Expression Language conditions
- Auth audit logging — All authorization decisions, API key lifecycle, and role/policy changes recorded
- Platform API — Multi-tenant control plane at
/api/v1/platform/*(users, roles, policies, tenants, audit) - Custom roles — Create roles beyond the built-in
admin/developer/viewer
Connecting Applications
Any application can now connect to this Ironflow instance:
# Application environment variablesIRONFLOW_URL=https://demo.ironflow.devIRONFLOW_API_KEY=ifkey_... # from apikey create outputIRONFLOW_ENV=productionPhase 6: Verify the SRE Stack
The full monitoring stack includes 4 Grafana dashboards, 29 infrastructure alert rules, Slack notification routing, and a Healthchecks.io dead-man’s switch.
Grafana Dashboards
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3030:80Open http://localhost:3030. Login with admin / (password from the grafana-admin secret).
| Dashboard | What it shows |
|---|---|
| Ironflow Performance | Request latency (p50/p95/p99), throughput (req/s), error rates by status code, run completion rates |
| PostgreSQL CNPG | Cluster health, replication lag, connection count vs max, transaction rates, disk usage |
| NATS Monitoring | JetStream storage utilization, consumer lag, slow consumers, message rates |
| Kubernetes Infrastructure | Node CPU/memory/disk, pod scheduling, container restarts, PVC usage |
Prometheus Targets
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090Open http://localhost:9090/targets. All targets should show UP:
ironflow(ServiceMonitor, :9123/metrics)nats(exporter sidecar, :7777/metrics)cnpg(PodMonitor)kube-state-metricstraefik(ServiceMonitor)
Alert Rules
kubectl get prometheusrules -n ironflow| PrometheusRule | Rules | Coverage |
|---|---|---|
ironflow-alerts | 17 | IronflowDown, HighErrorRate, HighLatency, NATSDown, PostgreSQLDown, DiskSpaceLow, ProbeFailure, HighRunFailureRate, WorkerDisconnected, MemoryPressure, AlertmanagerFailedNotifications, NATSPublishCircuitOpen, DLQBacklogGrowing, ProjectionLagHigh, SubscriptionDropsHigh, CodecEncodeErrorsHigh, CodecDecodeErrorsHigh |
pg-alerts | 8 | CNPGClusterDown, CNPGReplicationLagHigh, CNPGDiskSpaceCritical, CNPGConnectionSaturation, CNPGDiskSpaceWarning, CNPGBackupStale, CNPGHighRollbackRate, CNPGLongRunningTransaction |
nats-alerts | 4 | NATSJetStreamStorageCritical, NATSSlowConsumers, NATSHighPendingMessages, NATSServerHighMemory |
Alert Routing
Alert fired │ ▼Alertmanager groups by alertname + namespace │ ├── Watchdog ──▶ Healthchecks.io (1-minute ping) │ If pings stop → HC alerts you that monitoring is down │ ├── InfoInhibitor ──▶ /dev/null (suppressed) │ └── All others ──▶ Slack #ironflow-alerts (4-hour repeat) │ ├── Critical severity suppresses matching Warning alerts └── Format: severity, summary, description, runbook URLVerify Alertmanager routing:
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093Open http://localhost:9093. The Slack and Healthchecks.io routes should be visible under Status → Config.
Public Metrics Endpoint
The Ironflow metrics endpoint is accessible via the public URL:
curl https://demo.ironflow.dev/metrics | head -20This exposes Prometheus-format metrics including ironflow_runs_total, ironflow_http_requests_total, ironflow_http_request_duration_seconds_bucket, and NATS/scheduler internals.
Security Configuration
| Aspect | Setting | Details |
|---|---|---|
| API authentication | devMode: false | All requests require API key (enforced by Small template) |
| Dashboard auth | Password (cookie-based) | Set on first login |
| Secrets encryption | masterKey (AES-256) | Encrypts secrets in NATS KV |
| TLS | cert-manager + Let’s Encrypt | Automatic certificate issuance and renewal |
| Internal services | ClusterIP only | PostgreSQL and NATS are not exposed externally |
| Network policy | Not enabled | Acceptable for dedicated single-tenant cluster |
| Container security | Non-root, read-only filesystem | runAsUser: 100, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true |
| CEL policies | Custom RBAC with CEL expressions | Deny-always-wins, database-driven, decision caching |
| Auth audit | All auth events logged | Authorization decisions, key lifecycle, role/policy changes |
| Platform API | Multi-tenant control plane | /api/v1/platform/* (requires platform API key or JWT) |
Common Failure Modes
| Failure | Symptom | Diagnosis | Resolution |
|---|---|---|---|
| HCLOUD_TOKEN expired | terraform apply fails with 401 | Check token in Hetzner Console | Regenerate token, re-export HCLOUD_TOKEN |
| S3 credentials wrong | CNPG pod in CrashLoopBackOff | kubectl describe pod -n ironflow <pg-pod> | Delete and recreate ironflow-s3-creds secret, delete CNPG pod |
| DNS not propagated | cert-manager challenge fails | kubectl describe certificate -n ironflow | Wait for DNS, then delete the failed Certificate to retry |
| Let’s Encrypt rate limited | Certificate stuck in False state | kubectl describe order -n ironflow | Wait 1 hour. Use staging issuer for testing: --set ingress.annotations.cert-manager\.io/cluster-issuer=letsencrypt-staging |
| Slack secret missing | Alerts fire but no notifications | kubectl logs -n monitoring alertmanager-... | Create the alertmanager-slack secret in monitoring namespace |
| Healthchecks.io secret missing | Alertmanager pod fails to start | kubectl get pods -n monitoring | Create the healthchecks-io secret in monitoring namespace |
| masterKey not set | Secrets stored unencrypted | Warning in Ironflow startup logs | Redeploy with --set ironflow.masterKey=... |
| Worker OOM | Pods evicted, node pressure | kubectl describe node <worker> | Verify cpx32 worker type. If cpx22, reprovision with cpx32 |
Ongoing Maintenance
Upgrade Ironflow
To a specific release:
# clean buildmake all# IMPORTANT: Always include masterKey and ingress flags on upgrades. Helm# overwrites all values on each upgrade — omitting any causes them to revert# to defaults (e.g., ingress gets deleted, encrypted secrets become unreadable)../build/ironflow deploy upgrade --template small --name demo \ --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \ --set image.tag=v0.17.0 \ --set ingress.enabled=true \ --set ingress.host=demo.ironflow.dev \ --set ingress.className=traefikTo latest unreleased code (between releases):
If you want to deploy the latest code from main without cutting a formal release, rebuild and push the latest image tag:
# 1. Build and push a new :latest image from current codemake alldocker build -t ghcr.io/sahina/ironflow:latest \ --build-arg VERSION="$(git rev-parse --short HEAD)" \ --platform linux/amd64 .docker push ghcr.io/sahina/ironflow:latest
# 2. Force the cluster to pull the new image (since the tag hasn't changed,# Kubernetes won't re-pull unless the pod is deleted)./build/ironflow deploy upgrade --template small --name demo \ --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \ --set image.tag=latest \ --set ingress.enabled=true \ --set ingress.host=demo.ironflow.dev \ --set ingress.className=traefikkubectl rollout restart deployment/demo-ironflow -n ironflowWhy rollout restart?
When the image tag is latest, Kubernetes sees no spec change on upgrade and won’t pull the new image. rollout restart forces a new pod with a fresh pull. For versioned tags (e.g., v0.17.0) this isn’t needed.
Rolling update replaces the pod. Zero downtime for single-replica (brief unavailability during pod restart is acceptable for demo).
Check Backup Health
# Verify latest backupkubectl get backups -n ironflow --sort-by=.status.startedAt# Most recent should be < 24 hours old
# Verify backup schedulekubectl get scheduledbackups -n ironflowAccess Monitoring
# Grafanakubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3030:80
# Prometheuskubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Alertmanagerkubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093Renew TLS Certificate
cert-manager automatically renews the Let’s Encrypt certificate 30 days before expiry. No manual action required. To check status:
kubectl get certificate -n ironflow# READY should be True, EXPIRATION shows the current cert expiryTeardown
To completely remove the demo cluster:
# 1. Remove Ironflow Helm release (deletes pods, services, PVCs)./build/ironflow deploy delete --name demo
# 2. Destroy Hetzner infrastructure (nodes, LB, volumes)./build/ironflow provision destroy --provider hetzner --name demo
# 3. Clean up DNS# Manually remove the A record for demo.ironflow.dev at your DNS provider
# 4. Clean up local kubeconfigrm ~/.kube/clusters/hetzner-demo.yamlPartial teardown
To remove just Ironflow but keep the cluster: run only step 1. To remove monitoring: helm uninstall kube-prometheus-stack -n monitoring.
Verification Checklist
Use this checklist after completing all phases:
- Cluster: 2 nodes Ready (
kubectl get nodes) - Pods: All Running in
ironflow,monitoring,cnpg-system,traefiknamespaces - PostgreSQL: Cluster healthy (
kubectl get cluster -n ironflow) - Backups: ScheduledBackup configured (
kubectl get scheduledbackups -n ironflow) - Health:
https://demo.ironflow.dev/healthreturns 200 - Readiness:
https://demo.ironflow.dev/readyreturns{"status":"ok"} - Metrics:
https://demo.ironflow.dev/metricsreturns Prometheus metrics - Grafana: 4 dashboards load with data at
localhost:3030 - Prometheus: All targets show UP at
localhost:9090/targets - Capabilities:
curl .../api/v1/capabilitiesreturns a valid JSON payload - Alerts: All rules loaded (
kubectl get prometheusrules -n ironflow) - Slack: Alert routing configured (check Alertmanager UI)
- Healthchecks.io: Watchdog pings arriving (check Healthchecks.io dashboard)
- TLS: Valid certificate (
kubectl get certificate -n ironflow, READY=True)
Upgrade Path
When you outgrow the Small template:
| Trigger | Action |
|---|---|
| Need zero-downtime deploys | Upgrade to Medium (3 replicas, NATS cluster) |
| >1000 concurrent active runs | Upgrade to Medium |
| Need PostgreSQL HA | Upgrade to Medium (2 PG instances, PgBouncer) |
| Need auto-scaling | Upgrade to Large (HPA, external deps) |
Small to Medium requires a full redeploy (NATS topology changes from 1 to 3 nodes). Data survives via S3 backups + CNPG restore. Medium cost: ~€44/month.
See Scaling Scenarios for detailed upgrade procedures.