Demo Clusters

Deploy a single-tenant Ironflow demo cluster on Hetzner Cloud with full SRE observability. This playbook produces a production-grade cluster with CEL policies, auth audit logging, platform API, Prometheus, Grafana dashboards, Slack alerting, Healthchecks.io dead-man’s switch, TLS, and automated PostgreSQL backups for approximately €21/month.

When to use this playbook:

You need a live demo environment to showcase Ironflow
You want to validate the Small template deployment end-to-end
You need a staging cluster with full monitoring for development

Architecture

                         HETZNER CLOUD (~€21/month)
                                │
                ┌───────────────┴───────────────┐
                ▼                               ▼
        1x cpx22 Control              1x cpx32 Worker (8GB)
        (Talos Linux)                  (Talos Linux)
                                           │
               ┌──────────────────────┬────┴────────────────┐
               ▼                      ▼                     ▼
         Ironflow (512Mi)       NATS (256Mi)         PostgreSQL (512Mi)
         :9123 HTTP/gRPC        JetStream 2Gi        CNPG 5Gi + S3 backups
               │                      │                     │
               └──────────────────────┴─────────────────────┘
                                │
     Internet ──▶ Hetzner LB (€6/mo) ──▶ Traefik ──▶ TLS ──▶ Ironflow
                                │
                   kube-prometheus-stack (~1.5Gi)
                   ├── Prometheus (512Mi, 10Gi PVC, 15d retention)
                   ├── Grafana (256Mi, 4 dashboards)
                   ├── Alertmanager → Slack + Healthchecks.io
                   └── kube-state-metrics (256Mi)

Component Sizing

Component	RAM Request	RAM Limit	Storage	Notes
Ironflow	256Mi	512Mi	—	Single replica, bundled mode
NATS	128Mi	256Mi	2Gi PVC	JetStream, single node
PostgreSQL	256Mi	512Mi	5Gi PVC	CloudNativePG, 1 instance
Prometheus	512Mi	1Gi	10Gi PVC	15-day retention, hcloud-volumes
Grafana	128Mi	256Mi	—	4 pre-built dashboards
Alertmanager	50Mi	128Mi	—	Slack + Healthchecks.io routing
kube-state-metrics	128Mi	256Mi	—	Kubernetes metrics
Traefik	128Mi	256Mi	—	2 replicas, TLS termination
System pods	—	—	—	Cilium, CCM, cert-manager (~700Mi)
Total				~2.5Gi used on 8Gi worker

Cost Breakdown

Component	Monthly Cost
1x cpx22 (control plane, 3 vCPU, 4GB)	~€4.50
1x cpx32 (worker, 4 vCPU, 8GB)	~€8.50
Hetzner LB11 (load balancer)	~€6.00
Object Storage (PostgreSQL backups)	~€1.00
Total	~€20-21/month

Isolation Model

This is a single-tenant demo cluster. All traffic goes to one Ironflow instance with one project and environment. For multi-app shared clusters, see issue #373.

Layer	Mechanism	What it provides
API Key	Bound to one environment	Request authentication
Database filtering	`WHERE environment_id = $N` on all queries	Data isolation
NATS subject naming	`ironflow.{project}.{env}.events.>`	Message isolation
TLS	cert-manager + Let’s Encrypt	Encrypted transport

Prerequisites

Before starting, ensure you have:

Requirement	How to get it
Hetzner Cloud API token	Hetzner Console → Security → API Tokens
GitHub PAT (`read:packages` scope)	GitHub Settings → Personal Access Tokens
Hetzner Object Storage bucket	Hetzner Console → Object Storage → Create Bucket (name: `ironflow-backups`, private)
Domain with DNS access	Any DNS provider (e.g., Cloudflare, Route 53)
Slack webhook URL	Slack App → Incoming Webhooks → Create
Healthchecks.io ping URL	Healthchecks.io → New Check (5-minute period, free tier)

Set environment variables (or add them to a .env file and source it). If using a .env file, ensure it is in .gitignore to avoid committing credentials.

bash / zsh
fish

export HCLOUD_TOKEN="your-hetzner-api-token"
export GITHUB_USERNAME="your-github-username"
export GITHUB_PAT="ghp_your-personal-access-token"
export HETZNER_S3_ACCESS_KEY="your-s3-access-key"
export HETZNER_S3_SECRET_KEY="your-s3-secret-key"
export HETZNER_S3_ENDPOINT="https://fsn1.your-objectstorage.com"
export HETZNER_S3_BUCKET="ironflow-backups"
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../xxx"
export HEALTHCHECKS_PING_URL="https://hc-ping.com/your-uuid"

Source with source .env.

set -x HCLOUD_TOKEN "your-hetzner-api-token"
set -x GITHUB_USERNAME "your-github-username"
set -x GITHUB_PAT "ghp_your-personal-access-token"
set -x HETZNER_S3_ACCESS_KEY "your-s3-access-key"
set -x HETZNER_S3_SECRET_KEY "your-s3-secret-key"
set -x HETZNER_S3_ENDPOINT "https://fsn1.your-objectstorage.com"
set -x HETZNER_S3_BUCKET "ironflow-backups"
set -x SLACK_WEBHOOK_URL "https://hooks.slack.com/services/T.../B.../xxx"
set -x HEALTHCHECKS_PING_URL "https://hc-ping.com/your-uuid"

Fish does not source bash-style .env files. Either keep vars in ~/.config/fish/conf.d/ironflow.fish or use bass: bass source .env.

Phase 0: Build the Ironflow Binary

Build the Ironflow binary from source. This produces ./build/ironflow, which is used for all subsequent commands.

make all

Verify:

./build/ironflow version

Phase 1: Provision the Cluster

Provision a two-node Hetzner cluster using the Small template (1 control plane + 1 worker). The worker uses cpx32 (8GB RAM) to accommodate the full monitoring stack.

# Provision cluster (~5-8 minutes)
# Installs: Talos Linux, Cilium CNI, cert-manager, Hetzner CCM + CSI
./build/ironflow provision create --provider hetzner --template small --name demo

The provisioning command saves the kubeconfig to ~/.kube/clusters/hetzner-demo.yaml. Set it for all subsequent commands.

bash / zsh
fish

export KUBECONFIG=~/.kube/clusters/hetzner-demo.yaml

set -x KUBECONFIG $HOME/.kube/clusters/hetzner-demo.yaml

Fish + tilde gotcha

export KUBECONFIG=~/.kube/... does not work in fish. Tilde is only expanded at the start of a word, not after =, so KUBECONFIG ends up as the literal string ~/.kube/.... kubectl and helm then fail to open the file and silently fall back to localhost:8080, producing errors like kubernetes cluster unreachable: Get "http://localhost:8080/version". Use $HOME or an absolute path.

Verify the variable resolved to a real path:

echo $KUBECONFIG     # must show /Users/.../hetzner-demo.yaml, not ~/...
kubectl cluster-info # must hit the Hetzner API server, not localhost:8080

Verify:

kubectl get nodes

Expected output (both nodes Ready):

NAME                STATUS   ROLES           AGE   VERSION
control-demo-...    Ready    control-plane   5m    v1.31.x
worker-demo-...     Ready    <none>          4m    v1.31.x

Troubleshooting

If nodes show NotReady, wait 2-3 minutes for Talos to finish bootstrapping. The provisioning command retries up to 30 times (5 minutes total). If nodes remain NotReady after 10 minutes, check Hetzner Console for node status and run talosctl health for diagnostics.

Phase 2: Create Secrets

Create the namespaces and secrets that the deployment and monitoring stack depend on. These must exist before running ironflow deploy.

Ironflow Secrets

# Create namespace
kubectl create namespace ironflow

# Container registry credentials (for pulling Ironflow images from GHCR)
kubectl create secret docker-registry ghcr-pull-secret -n ironflow \
  --docker-server=ghcr.io \
  --docker-username=$GITHUB_USERNAME \
  --docker-password=$GITHUB_PAT

# S3 backup credentials (for PostgreSQL WAL archiving and daily backups)
kubectl create secret generic ironflow-s3-creds -n ironflow \
  --from-literal=ACCESS_KEY_ID="$HETZNER_S3_ACCESS_KEY" \
  --from-literal=SECRET_ACCESS_KEY="$HETZNER_S3_SECRET_KEY"

Monitoring Secrets

# Create namespace
kubectl create namespace monitoring

# Slack webhook for alert notifications
kubectl create secret generic alertmanager-slack -n monitoring \
  --from-literal=webhook-url="$SLACK_WEBHOOK_URL"

# Healthchecks.io dead-man's switch (Watchdog alert pings this URL every minute;
# if it stops, Healthchecks.io alerts you that the monitoring stack itself is down)
kubectl create secret generic healthchecks-io -n monitoring \
  --from-literal=ping-url="$HEALTHCHECKS_PING_URL"

# Grafana admin credentials
kubectl create secret generic grafana-admin -n monitoring \
  --from-literal=admin-user=admin \
  --from-literal=admin-password="$(openssl rand -base64 16)"

Retrieve Grafana password later

kubectl get secret grafana-admin -n monitoring \
  -o jsonpath='{.data.admin-password}' | base64 -d && echo

Verify:

kubectl get secrets -n ironflow
kubectl get secrets -n monitoring

Expected: ghcr-pull-secret, ironflow-s3-creds in ironflow namespace. alertmanager-slack, healthchecks-io, grafana-admin in monitoring namespace.

Phase 3: Deploy Ironflow

Deploy Ironflow with the Small template and Hetzner load balancer support. This command automatically installs all prerequisites:

CloudNativePG (PostgreSQL operator)
Barman Cloud Plugin (S3-compatible backups)
kube-prometheus-stack (Prometheus, Grafana, Alertmanager with full Hetzner config)
Traefik (ingress controller with Hetzner LB annotations)

Build and push the image

Build a Docker image from the current code and push it to GHCR. The cluster will pull this image during deployment.

# Build and push the image (linux/amd64 for Hetzner)
docker build -t ghcr.io/sahina/ironflow:latest \
  --build-arg VERSION="$(git rev-parse --short HEAD)" \
  --platform linux/amd64 .
echo $GITHUB_PAT | docker login ghcr.io -u $GITHUB_USERNAME --password-stdin
docker push ghcr.io/sahina/ironflow:latest

Deploy

Generate the master key first

The masterKey encrypts secrets stored in NATS KV (AES-256). If you lose it, encrypted secrets become unreadable. Generate and save it before running deploy. You must pass the same key on every ironflow deploy upgrade command, or the Helm Secret will be overwritten without it.

bash / zsh
fish

export IRONFLOW_MASTER_KEY=$(openssl rand -hex 32)
echo "SAVE THIS KEY: $IRONFLOW_MASTER_KEY"

set -x IRONFLOW_MASTER_KEY (openssl rand -hex 32)
echo "SAVE THIS KEY: $IRONFLOW_MASTER_KEY"

# Deploy Ironflow with full SRE stack
# The --hetzner-location flag enables Traefik + load balancer + full monitoring config
./build/ironflow deploy --template small --name demo \
  --hetzner-location fsn1 \
  --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
  --set image.tag=latest

Wait for all pods (3-5 minutes, CNPG bootstrap is the slowest):

# Watch Ironflow pods
kubectl get pods -n ironflow -w

# Watch monitoring pods
kubectl get pods -n monitoring -w

# Verify PostgreSQL cluster is healthy
kubectl get cluster -n ironflow
# Expected: Phase = "Cluster in healthy state"

# Verify daily backups are scheduled
kubectl get scheduledbackups -n ironflow
# Expected: 1 ScheduledBackup (daily at 2 AM)

What gets deployed

  Namespace: ironflow
  ├── demo-ironflow (Deployment, 1 replica)
  │   ├── ironflow container (:9123)
  │   ├── CEL policy evaluator, auth audit recorder, platform API
  │   └── init containers (wait-for-pg, wait-for-nats)
  ├── demo-ironflow-pg (CNPG Cluster, 1 instance)
  │   ├── PostgreSQL 16
  │   ├── 5Gi PVC (hcloud-volumes)
  │   └── S3 WAL archiving (Barman Cloud Plugin)
  ├── demo-nats (StatefulSet, 1 replica)
  │   ├── NATS with JetStream
  │   ├── 2Gi PVC (hcloud-volumes)
  │   └── Prometheus exporter sidecar (:7777)
  └── demo-ironflow-pg-backup (ScheduledBackup, daily 2 AM)

  Namespace: monitoring
  ├── kube-prometheus-stack-prometheus (StatefulSet)
  │   ├── 10Gi PVC (hcloud-volumes, 15-day retention)
  │   └── ServiceMonitor + PodMonitor selectors (all namespaces)
  ├── kube-prometheus-stack-grafana (Deployment)
  │   ├── Dashboard sidecar (auto-discovers ConfigMaps)
  │   └── Admin from grafana-admin secret
  ├── kube-prometheus-stack-alertmanager (StatefulSet)
  │   ├── Slack receiver (#ironflow-alerts, 4h repeat)
  │   ├── Healthchecks.io receiver (Watchdog, 1m repeat)
  │   └── Inhibition: critical suppresses warning
  └── kube-state-metrics (Deployment)

  Namespace: traefik
  └── traefik (Deployment, 2 replicas)
      ├── Hetzner LB annotations (private network, proxy protocol)
      └── Prometheus ServiceMonitor

  Namespace: cnpg-system
  ├── cloudnative-pg (Deployment, CNPG operator)
  └── barman-cloud-instance-manager (Deployment)

Phase 4: DNS and Ingress

Ingress is enabled in a two-phase process to avoid burning Let’s Encrypt rate limits. cert-manager requests a TLS certificate via HTTP-01 challenge, which requires DNS to resolve to the load balancer IP first.

Step 1: Get the Load Balancer IP

# Wait for Hetzner to provision the load balancer (1-2 minutes)
kubectl get svc -n traefik traefik -w

Wait until EXTERNAL-IP shows an IP address (not <pending>). Note this IP.

Step 2: Create DNS Record

At your DNS provider, create an A record:

demo.ironflow.dev  →  A  <EXTERNAL-IP>

Step 3: Verify DNS Propagation

Do NOT skip this step

Enabling ingress before DNS propagates triggers a Let’s Encrypt HTTP-01 challenge that fails. Let’s Encrypt has rate limits (5 failures per hour per domain). Failed attempts are not recoverable within that window.

dig demo.ironflow.dev +short
# MUST return the load balancer IP. If empty, wait and retry.

Step 4: Enable Ingress with TLS

Only run this after dig returns the correct IP:

./build/ironflow deploy upgrade --template small --name demo \
  --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
  --set image.tag=latest \
  --set ingress.enabled=true \
  --set ingress.host=demo.ironflow.dev \
  --set ingress.className=traefik

Verify TLS certificate:

# Check certificate status (may take 1-2 minutes)
kubectl get certificate -n ironflow
# Expected: READY = True

# Test HTTPS
curl -I https://demo.ironflow.dev/health
# Expected: HTTP/2 200

curl https://demo.ironflow.dev/ready
# Expected: {"status":"ok"}

Traffic Flow

Client HTTPS request
    │
    ▼
Hetzner LB (Layer 4, TCP, proxy protocol)
    │ Private network
    ▼
Traefik (TLS termination, hostname routing)
    │ Proxy headers (X-Forwarded-For, X-Real-IP)
    ▼
Ironflow Service (ClusterIP :9123)
    │
    ├──▶ HTTP API (REST + ConnectRPC)
    ├──▶ gRPC API (pull-mode workers)
    └──▶ Dashboard (embedded React app)

Phase 5: Configure Ironflow

Use the CLI to create a project, environment, and API key for the demo. The CLI targets localhost:9123 by default, so use port-forwarding to access the remote cluster.

# Port-forward Ironflow service to localhost
kubectl port-forward svc/demo-ironflow -n ironflow 9123:9123 &
PF_PID=$!

# Extract admin credentials from startup logs
kubectl logs -n ironflow $(kubectl get pods -n ironflow \
  -l app.kubernetes.io/component=server -o name | head -1) \
  | grep -A12 "Admin API Key"
# This shows:
#   Admin API Key:    ifkey_...   (for CLI and SDK authentication)
#   Dashboard Admin:
#     Email:    admin@ironflow.local
#     Password: <random>          (for dashboard login at /login)
# Save both — they are only shown on first boot.

# Create a demo project and environment
./build/ironflow project create demo
./build/ironflow env create production --project demo

# Create an API key for the demo environment
./build/ironflow apikey create "demo-prod" --env env_demo_production
# Save the returned API key (ifkey_...)

# Stop port-forward
kill $PF_PID

Verify Capabilities

# Confirm the server is healthy and capabilities are reported
curl -s https://demo.ironflow.dev/api/v1/capabilities | jq

What this demo cluster includes

CEL policies — Custom role-based access control with Common Expression Language conditions
Auth audit logging — All authorization decisions, API key lifecycle, and role/policy changes recorded
Platform API — Multi-tenant control plane at /api/v1/platform/* (users, roles, policies, tenants, audit)
Custom roles — Create roles beyond the built-in admin/developer/viewer

Connecting Applications

Any application can now connect to this Ironflow instance:

# Application environment variables
IRONFLOW_URL=https://demo.ironflow.dev
IRONFLOW_API_KEY=ifkey_...        # from apikey create output
IRONFLOW_ENV=production

Phase 6: Verify the SRE Stack

The full monitoring stack includes 4 Grafana dashboards, 29 infrastructure alert rules, Slack notification routing, and a Healthchecks.io dead-man’s switch.

Grafana Dashboards

kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3030:80

Open http://localhost:3030. Login with admin / (password from the grafana-admin secret).

Dashboard	What it shows
Ironflow Performance	Request latency (p50/p95/p99), throughput (req/s), error rates by status code, run completion rates
PostgreSQL CNPG	Cluster health, replication lag, connection count vs max, transaction rates, disk usage
NATS Monitoring	JetStream storage utilization, consumer lag, slow consumers, message rates
Kubernetes Infrastructure	Node CPU/memory/disk, pod scheduling, container restarts, PVC usage

Prometheus Targets

kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090

Open http://localhost:9090/targets. All targets should show UP:

ironflow (ServiceMonitor, :9123/metrics)
nats (exporter sidecar, :7777/metrics)
cnpg (PodMonitor)
kube-state-metrics
traefik (ServiceMonitor)

Alert Rules

kubectl get prometheusrules -n ironflow

PrometheusRule	Rules	Coverage
`ironflow-alerts`	17	IronflowDown, HighErrorRate, HighLatency, NATSDown, PostgreSQLDown, DiskSpaceLow, ProbeFailure, HighRunFailureRate, WorkerDisconnected, MemoryPressure, AlertmanagerFailedNotifications, NATSPublishCircuitOpen, DLQBacklogGrowing, ProjectionLagHigh, SubscriptionDropsHigh, CodecEncodeErrorsHigh, CodecDecodeErrorsHigh
`pg-alerts`	8	CNPGClusterDown, CNPGReplicationLagHigh, CNPGDiskSpaceCritical, CNPGConnectionSaturation, CNPGDiskSpaceWarning, CNPGBackupStale, CNPGHighRollbackRate, CNPGLongRunningTransaction
`nats-alerts`	4	NATSJetStreamStorageCritical, NATSSlowConsumers, NATSHighPendingMessages, NATSServerHighMemory

Alert Routing

Alert fired
    │
    ▼
Alertmanager groups by alertname + namespace
    │
    ├── Watchdog ──▶ Healthchecks.io (1-minute ping)
    │                If pings stop → HC alerts you that monitoring is down
    │
    ├── InfoInhibitor ──▶ /dev/null (suppressed)
    │
    └── All others ──▶ Slack #ironflow-alerts (4-hour repeat)
         │
         ├── Critical severity suppresses matching Warning alerts
         └── Format: severity, summary, description, runbook URL

Verify Alertmanager routing:

kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093

Open http://localhost:9093. The Slack and Healthchecks.io routes should be visible under Status → Config.

Public Metrics Endpoint

The Ironflow metrics endpoint is accessible via the public URL:

curl https://demo.ironflow.dev/metrics | head -20

This exposes Prometheus-format metrics including ironflow_runs_total, ironflow_http_requests_total, ironflow_http_request_duration_seconds_bucket, and NATS/scheduler internals.

Security Configuration

Aspect	Setting	Details
API authentication	`devMode: false`	All requests require API key (enforced by Small template)
Dashboard auth	Password (cookie-based)	Set on first login
Secrets encryption	`masterKey` (AES-256)	Encrypts secrets in NATS KV
TLS	cert-manager + Let’s Encrypt	Automatic certificate issuance and renewal
Internal services	ClusterIP only	PostgreSQL and NATS are not exposed externally
Network policy	Not enabled	Acceptable for dedicated single-tenant cluster
Container security	Non-root, read-only filesystem	`runAsUser: 100`, `allowPrivilegeEscalation: false`, `readOnlyRootFilesystem: true`
CEL policies	Custom RBAC with CEL expressions	Deny-always-wins, database-driven, decision caching
Auth audit	All auth events logged	Authorization decisions, key lifecycle, role/policy changes
Platform API	Multi-tenant control plane	`/api/v1/platform/*` (requires platform API key or JWT)

Common Failure Modes

Failure	Symptom	Diagnosis	Resolution
HCLOUD_TOKEN expired	`terraform apply` fails with 401	Check token in Hetzner Console	Regenerate token, re-export `HCLOUD_TOKEN`
S3 credentials wrong	CNPG pod in CrashLoopBackOff	`kubectl describe pod -n ironflow <pg-pod>`	Delete and recreate `ironflow-s3-creds` secret, delete CNPG pod
DNS not propagated	cert-manager challenge fails	`kubectl describe certificate -n ironflow`	Wait for DNS, then delete the failed Certificate to retry
Let’s Encrypt rate limited	Certificate stuck in `False` state	`kubectl describe order -n ironflow`	Wait 1 hour. Use staging issuer for testing: `--set ingress.annotations.cert-manager\.io/cluster-issuer=letsencrypt-staging`
Slack secret missing	Alerts fire but no notifications	`kubectl logs -n monitoring alertmanager-...`	Create the `alertmanager-slack` secret in monitoring namespace
Healthchecks.io secret missing	Alertmanager pod fails to start	`kubectl get pods -n monitoring`	Create the `healthchecks-io` secret in monitoring namespace
masterKey not set	Secrets stored unencrypted	Warning in Ironflow startup logs	Redeploy with `--set ironflow.masterKey=...`
Worker OOM	Pods evicted, node pressure	`kubectl describe node <worker>`	Verify cpx32 worker type. If cpx22, reprovision with cpx32

Ongoing Maintenance

Upgrade Ironflow

To a specific release:

# clean build
make all
# IMPORTANT: Always include masterKey and ingress flags on upgrades. Helm
# overwrites all values on each upgrade — omitting any causes them to revert
# to defaults (e.g., ingress gets deleted, encrypted secrets become unreadable).
./build/ironflow deploy upgrade --template small --name demo \
  --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
  --set image.tag=v0.17.0 \
  --set ingress.enabled=true \
  --set ingress.host=demo.ironflow.dev \
  --set ingress.className=traefik

To latest unreleased code (between releases):

If you want to deploy the latest code from main without cutting a formal release, rebuild and push the latest image tag:

# 1. Build and push a new :latest image from current code
make all
docker build -t ghcr.io/sahina/ironflow:latest \
  --build-arg VERSION="$(git rev-parse --short HEAD)" \
  --platform linux/amd64 .
docker push ghcr.io/sahina/ironflow:latest

# 2. Force the cluster to pull the new image (since the tag hasn't changed,
#    Kubernetes won't re-pull unless the pod is deleted)
./build/ironflow deploy upgrade --template small --name demo \
  --set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
  --set image.tag=latest \
  --set ingress.enabled=true \
  --set ingress.host=demo.ironflow.dev \
  --set ingress.className=traefik
kubectl rollout restart deployment/demo-ironflow -n ironflow

Why rollout restart?

When the image tag is latest, Kubernetes sees no spec change on upgrade and won’t pull the new image. rollout restart forces a new pod with a fresh pull. For versioned tags (e.g., v0.17.0) this isn’t needed.

Rolling update replaces the pod. Zero downtime for single-replica (brief unavailability during pod restart is acceptable for demo).

Check Backup Health

# Verify latest backup
kubectl get backups -n ironflow --sort-by=.status.startedAt
# Most recent should be < 24 hours old

# Verify backup schedule
kubectl get scheduledbackups -n ironflow

Access Monitoring

# Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3030:80

# Prometheus
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090

# Alertmanager
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093

Renew TLS Certificate

cert-manager automatically renews the Let’s Encrypt certificate 30 days before expiry. No manual action required. To check status:

kubectl get certificate -n ironflow
# READY should be True, EXPIRATION shows the current cert expiry

Teardown

To completely remove the demo cluster:

# 1. Remove Ironflow Helm release (deletes pods, services, PVCs)
./build/ironflow deploy delete --name demo

# 2. Destroy Hetzner infrastructure (nodes, LB, volumes)
./build/ironflow provision destroy --provider hetzner --name demo

# 3. Clean up DNS
# Manually remove the A record for demo.ironflow.dev at your DNS provider

# 4. Clean up local kubeconfig
rm ~/.kube/clusters/hetzner-demo.yaml

Partial teardown

To remove just Ironflow but keep the cluster: run only step 1. To remove monitoring: helm uninstall kube-prometheus-stack -n monitoring.

Verification Checklist

Use this checklist after completing all phases:

Upgrade Path

When you outgrow the Small template:

Trigger	Action
Need zero-downtime deploys	Upgrade to Medium (3 replicas, NATS cluster)
>1000 concurrent active runs	Upgrade to Medium
Need PostgreSQL HA	Upgrade to Medium (2 PG instances, PgBouncer)
Need auto-scaling	Upgrade to Large (HPA, external deps)

Small to Medium requires a full redeploy (NATS topology changes from 1 to 3 nodes). Data survives via S3 backups + CNPG restore. Medium cost: ~€44/month.

See Scaling Scenarios for detailed upgrade procedures.