Skip to content

Demo Clusters

Deploy a single-tenant Ironflow demo cluster on Hetzner Cloud with full SRE observability. This playbook produces a production-grade cluster with CEL policies, auth audit logging, platform API, Prometheus, Grafana dashboards, Slack alerting, Healthchecks.io dead-man’s switch, TLS, and automated PostgreSQL backups for approximately €21/month.

When to use this playbook:

  • You need a live demo environment to showcase Ironflow
  • You want to validate the Small template deployment end-to-end
  • You need a staging cluster with full monitoring for development

Architecture

HETZNER CLOUD (~€21/month)
┌───────────────┴───────────────┐
▼ ▼
1x cpx22 Control 1x cpx32 Worker (8GB)
(Talos Linux) (Talos Linux)
┌──────────────────────┬────┴────────────────┐
▼ ▼ ▼
Ironflow (512Mi) NATS (256Mi) PostgreSQL (512Mi)
:9123 HTTP/gRPC JetStream 2Gi CNPG 5Gi + S3 backups
│ │ │
└──────────────────────┴─────────────────────┘
Internet ──▶ Hetzner LB (€6/mo) ──▶ Traefik ──▶ TLS ──▶ Ironflow
kube-prometheus-stack (~1.5Gi)
├── Prometheus (512Mi, 10Gi PVC, 15d retention)
├── Grafana (256Mi, 4 dashboards)
├── Alertmanager → Slack + Healthchecks.io
└── kube-state-metrics (256Mi)

Component Sizing

ComponentRAM RequestRAM LimitStorageNotes
Ironflow256Mi512MiSingle replica, bundled mode
NATS128Mi256Mi2Gi PVCJetStream, single node
PostgreSQL256Mi512Mi5Gi PVCCloudNativePG, 1 instance
Prometheus512Mi1Gi10Gi PVC15-day retention, hcloud-volumes
Grafana128Mi256Mi4 pre-built dashboards
Alertmanager50Mi128MiSlack + Healthchecks.io routing
kube-state-metrics128Mi256MiKubernetes metrics
Traefik128Mi256Mi2 replicas, TLS termination
System podsCilium, CCM, cert-manager (~700Mi)
Total~2.5Gi used on 8Gi worker

Cost Breakdown

ComponentMonthly Cost
1x cpx22 (control plane, 3 vCPU, 4GB)~€4.50
1x cpx32 (worker, 4 vCPU, 8GB)~€8.50
Hetzner LB11 (load balancer)~€6.00
Object Storage (PostgreSQL backups)~€1.00
Total~€20-21/month

Isolation Model

This is a single-tenant demo cluster. All traffic goes to one Ironflow instance with one project and environment. For multi-app shared clusters, see issue #373.

LayerMechanismWhat it provides
API KeyBound to one environmentRequest authentication
Database filteringWHERE environment_id = $N on all queriesData isolation
NATS subject namingironflow.{project}.{env}.events.>Message isolation
TLScert-manager + Let’s EncryptEncrypted transport

Prerequisites

Before starting, ensure you have:

RequirementHow to get it
Hetzner Cloud API tokenHetzner Console → Security → API Tokens
GitHub PAT (read:packages scope)GitHub Settings → Personal Access Tokens
Hetzner Object Storage bucketHetzner Console → Object Storage → Create Bucket (name: ironflow-backups, private)
Domain with DNS accessAny DNS provider (e.g., Cloudflare, Route 53)
Slack webhook URLSlack App → Incoming Webhooks → Create
Healthchecks.io ping URLHealthchecks.io → New Check (5-minute period, free tier)

Set environment variables (or add them to a .env file and source it). If using a .env file, ensure it is in .gitignore to avoid committing credentials.

Terminal window
export HCLOUD_TOKEN="your-hetzner-api-token"
export GITHUB_USERNAME="your-github-username"
export GITHUB_PAT="ghp_your-personal-access-token"
export HETZNER_S3_ACCESS_KEY="your-s3-access-key"
export HETZNER_S3_SECRET_KEY="your-s3-secret-key"
export HETZNER_S3_ENDPOINT="https://fsn1.your-objectstorage.com"
export HETZNER_S3_BUCKET="ironflow-backups"
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../xxx"
export HEALTHCHECKS_PING_URL="https://hc-ping.com/your-uuid"

Source with source .env.


Phase 0: Build the Ironflow Binary

Build the Ironflow binary from source. This produces ./build/ironflow, which is used for all subsequent commands.

Terminal window
make all

Verify:

Terminal window
./build/ironflow version

Phase 1: Provision the Cluster

Provision a two-node Hetzner cluster using the Small template (1 control plane + 1 worker). The worker uses cpx32 (8GB RAM) to accommodate the full monitoring stack.

Terminal window
# Provision cluster (~5-8 minutes)
# Installs: Talos Linux, Cilium CNI, cert-manager, Hetzner CCM + CSI
./build/ironflow provision create --provider hetzner --template small --name demo

The provisioning command saves the kubeconfig to ~/.kube/clusters/hetzner-demo.yaml. Set it for all subsequent commands.

Terminal window
export KUBECONFIG=~/.kube/clusters/hetzner-demo.yaml

Verify the variable resolved to a real path:

Terminal window
echo $KUBECONFIG # must show /Users/.../hetzner-demo.yaml, not ~/...
kubectl cluster-info # must hit the Hetzner API server, not localhost:8080

Verify:

Terminal window
kubectl get nodes

Expected output (both nodes Ready):

NAME STATUS ROLES AGE VERSION
control-demo-... Ready control-plane 5m v1.31.x
worker-demo-... Ready <none> 4m v1.31.x

Troubleshooting

If nodes show NotReady, wait 2-3 minutes for Talos to finish bootstrapping. The provisioning command retries up to 30 times (5 minutes total). If nodes remain NotReady after 10 minutes, check Hetzner Console for node status and run talosctl health for diagnostics.


Phase 2: Create Secrets

Create the namespaces and secrets that the deployment and monitoring stack depend on. These must exist before running ironflow deploy.

Ironflow Secrets

Terminal window
# Create namespace
kubectl create namespace ironflow
# Container registry credentials (for pulling Ironflow images from GHCR)
kubectl create secret docker-registry ghcr-pull-secret -n ironflow \
--docker-server=ghcr.io \
--docker-username=$GITHUB_USERNAME \
--docker-password=$GITHUB_PAT
# S3 backup credentials (for PostgreSQL WAL archiving and daily backups)
kubectl create secret generic ironflow-s3-creds -n ironflow \
--from-literal=ACCESS_KEY_ID="$HETZNER_S3_ACCESS_KEY" \
--from-literal=SECRET_ACCESS_KEY="$HETZNER_S3_SECRET_KEY"

Monitoring Secrets

Terminal window
# Create namespace
kubectl create namespace monitoring
# Slack webhook for alert notifications
kubectl create secret generic alertmanager-slack -n monitoring \
--from-literal=webhook-url="$SLACK_WEBHOOK_URL"
# Healthchecks.io dead-man's switch (Watchdog alert pings this URL every minute;
# if it stops, Healthchecks.io alerts you that the monitoring stack itself is down)
kubectl create secret generic healthchecks-io -n monitoring \
--from-literal=ping-url="$HEALTHCHECKS_PING_URL"
# Grafana admin credentials
kubectl create secret generic grafana-admin -n monitoring \
--from-literal=admin-user=admin \
--from-literal=admin-password="$(openssl rand -base64 16)"

Retrieve Grafana password later

Terminal window
kubectl get secret grafana-admin -n monitoring \
-o jsonpath='{.data.admin-password}' | base64 -d && echo

Verify:

Terminal window
kubectl get secrets -n ironflow
kubectl get secrets -n monitoring

Expected: ghcr-pull-secret, ironflow-s3-creds in ironflow namespace. alertmanager-slack, healthchecks-io, grafana-admin in monitoring namespace.


Phase 3: Deploy Ironflow

Deploy Ironflow with the Small template and Hetzner load balancer support. This command automatically installs all prerequisites:

  • CloudNativePG (PostgreSQL operator)
  • Barman Cloud Plugin (S3-compatible backups)
  • kube-prometheus-stack (Prometheus, Grafana, Alertmanager with full Hetzner config)
  • Traefik (ingress controller with Hetzner LB annotations)

Build and push the image

Build a Docker image from the current code and push it to GHCR. The cluster will pull this image during deployment.

Terminal window
# Build and push the image (linux/amd64 for Hetzner)
docker build -t ghcr.io/sahina/ironflow:latest \
--build-arg VERSION="$(git rev-parse --short HEAD)" \
--platform linux/amd64 .
echo $GITHUB_PAT | docker login ghcr.io -u $GITHUB_USERNAME --password-stdin
docker push ghcr.io/sahina/ironflow:latest

Deploy

Generate the master key first

The masterKey encrypts secrets stored in NATS KV (AES-256). If you lose it, encrypted secrets become unreadable. Generate and save it before running deploy. You must pass the same key on every ironflow deploy upgrade command, or the Helm Secret will be overwritten without it.

Terminal window
export IRONFLOW_MASTER_KEY=$(openssl rand -hex 32)
echo "SAVE THIS KEY: $IRONFLOW_MASTER_KEY"
Terminal window
# Deploy Ironflow with full SRE stack
# The --hetzner-location flag enables Traefik + load balancer + full monitoring config
./build/ironflow deploy --template small --name demo \
--hetzner-location fsn1 \
--set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
--set image.tag=latest

Wait for all pods (3-5 minutes, CNPG bootstrap is the slowest):

Terminal window
# Watch Ironflow pods
kubectl get pods -n ironflow -w
# Watch monitoring pods
kubectl get pods -n monitoring -w
# Verify PostgreSQL cluster is healthy
kubectl get cluster -n ironflow
# Expected: Phase = "Cluster in healthy state"
# Verify daily backups are scheduled
kubectl get scheduledbackups -n ironflow
# Expected: 1 ScheduledBackup (daily at 2 AM)

What gets deployed

Namespace: ironflow
├── demo-ironflow (Deployment, 1 replica)
│ ├── ironflow container (:9123)
│ ├── CEL policy evaluator, auth audit recorder, platform API
│ └── init containers (wait-for-pg, wait-for-nats)
├── demo-ironflow-pg (CNPG Cluster, 1 instance)
│ ├── PostgreSQL 16
│ ├── 5Gi PVC (hcloud-volumes)
│ └── S3 WAL archiving (Barman Cloud Plugin)
├── demo-nats (StatefulSet, 1 replica)
│ ├── NATS with JetStream
│ ├── 2Gi PVC (hcloud-volumes)
│ └── Prometheus exporter sidecar (:7777)
└── demo-ironflow-pg-backup (ScheduledBackup, daily 2 AM)
Namespace: monitoring
├── kube-prometheus-stack-prometheus (StatefulSet)
│ ├── 10Gi PVC (hcloud-volumes, 15-day retention)
│ └── ServiceMonitor + PodMonitor selectors (all namespaces)
├── kube-prometheus-stack-grafana (Deployment)
│ ├── Dashboard sidecar (auto-discovers ConfigMaps)
│ └── Admin from grafana-admin secret
├── kube-prometheus-stack-alertmanager (StatefulSet)
│ ├── Slack receiver (#ironflow-alerts, 4h repeat)
│ ├── Healthchecks.io receiver (Watchdog, 1m repeat)
│ └── Inhibition: critical suppresses warning
└── kube-state-metrics (Deployment)
Namespace: traefik
└── traefik (Deployment, 2 replicas)
├── Hetzner LB annotations (private network, proxy protocol)
└── Prometheus ServiceMonitor
Namespace: cnpg-system
├── cloudnative-pg (Deployment, CNPG operator)
└── barman-cloud-instance-manager (Deployment)

Phase 4: DNS and Ingress

Ingress is enabled in a two-phase process to avoid burning Let’s Encrypt rate limits. cert-manager requests a TLS certificate via HTTP-01 challenge, which requires DNS to resolve to the load balancer IP first.

Step 1: Get the Load Balancer IP

Terminal window
# Wait for Hetzner to provision the load balancer (1-2 minutes)
kubectl get svc -n traefik traefik -w

Wait until EXTERNAL-IP shows an IP address (not <pending>). Note this IP.

Step 2: Create DNS Record

At your DNS provider, create an A record:

demo.ironflow.dev → A <EXTERNAL-IP>

Step 3: Verify DNS Propagation

Do NOT skip this step

Enabling ingress before DNS propagates triggers a Let’s Encrypt HTTP-01 challenge that fails. Let’s Encrypt has rate limits (5 failures per hour per domain). Failed attempts are not recoverable within that window.

Terminal window
dig demo.ironflow.dev +short
# MUST return the load balancer IP. If empty, wait and retry.

Step 4: Enable Ingress with TLS

Only run this after dig returns the correct IP:

Terminal window
./build/ironflow deploy upgrade --template small --name demo \
--set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
--set image.tag=latest \
--set ingress.enabled=true \
--set ingress.host=demo.ironflow.dev \
--set ingress.className=traefik

Verify TLS certificate:

Terminal window
# Check certificate status (may take 1-2 minutes)
kubectl get certificate -n ironflow
# Expected: READY = True
# Test HTTPS
curl -I https://demo.ironflow.dev/health
# Expected: HTTP/2 200
curl https://demo.ironflow.dev/ready
# Expected: {"status":"ok"}

Traffic Flow

Client HTTPS request
Hetzner LB (Layer 4, TCP, proxy protocol)
│ Private network
Traefik (TLS termination, hostname routing)
│ Proxy headers (X-Forwarded-For, X-Real-IP)
Ironflow Service (ClusterIP :9123)
├──▶ HTTP API (REST + ConnectRPC)
├──▶ gRPC API (pull-mode workers)
└──▶ Dashboard (embedded React app)

Phase 5: Configure Ironflow

Use the CLI to create a project, environment, and API key for the demo. The CLI targets localhost:9123 by default, so use port-forwarding to access the remote cluster.

Terminal window
# Port-forward Ironflow service to localhost
kubectl port-forward svc/demo-ironflow -n ironflow 9123:9123 &
PF_PID=$!
# Extract admin credentials from startup logs
kubectl logs -n ironflow $(kubectl get pods -n ironflow \
-l app.kubernetes.io/component=server -o name | head -1) \
| grep -A12 "Admin API Key"
# This shows:
# Admin API Key: ifkey_... (for CLI and SDK authentication)
# Dashboard Admin:
# Email: admin@ironflow.local
# Password: <random> (for dashboard login at /login)
# Save both — they are only shown on first boot.
# Create a demo project and environment
./build/ironflow project create demo
./build/ironflow env create production --project demo
# Create an API key for the demo environment
./build/ironflow apikey create "demo-prod" --env env_demo_production
# Save the returned API key (ifkey_...)
# Stop port-forward
kill $PF_PID

Verify Capabilities

Terminal window
# Confirm the server is healthy and capabilities are reported
curl -s https://demo.ironflow.dev/api/v1/capabilities | jq

What this demo cluster includes

  • CEL policies — Custom role-based access control with Common Expression Language conditions
  • Auth audit logging — All authorization decisions, API key lifecycle, and role/policy changes recorded
  • Platform API — Multi-tenant control plane at /api/v1/platform/* (users, roles, policies, tenants, audit)
  • Custom roles — Create roles beyond the built-in admin/developer/viewer

Connecting Applications

Any application can now connect to this Ironflow instance:

Terminal window
# Application environment variables
IRONFLOW_URL=https://demo.ironflow.dev
IRONFLOW_API_KEY=ifkey_... # from apikey create output
IRONFLOW_ENV=production

Phase 6: Verify the SRE Stack

The full monitoring stack includes 4 Grafana dashboards, 29 infrastructure alert rules, Slack notification routing, and a Healthchecks.io dead-man’s switch.

Grafana Dashboards

Terminal window
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3030:80

Open http://localhost:3030. Login with admin / (password from the grafana-admin secret).

DashboardWhat it shows
Ironflow PerformanceRequest latency (p50/p95/p99), throughput (req/s), error rates by status code, run completion rates
PostgreSQL CNPGCluster health, replication lag, connection count vs max, transaction rates, disk usage
NATS MonitoringJetStream storage utilization, consumer lag, slow consumers, message rates
Kubernetes InfrastructureNode CPU/memory/disk, pod scheduling, container restarts, PVC usage

Prometheus Targets

Terminal window
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090

Open http://localhost:9090/targets. All targets should show UP:

  • ironflow (ServiceMonitor, :9123/metrics)
  • nats (exporter sidecar, :7777/metrics)
  • cnpg (PodMonitor)
  • kube-state-metrics
  • traefik (ServiceMonitor)

Alert Rules

Terminal window
kubectl get prometheusrules -n ironflow
PrometheusRuleRulesCoverage
ironflow-alerts17IronflowDown, HighErrorRate, HighLatency, NATSDown, PostgreSQLDown, DiskSpaceLow, ProbeFailure, HighRunFailureRate, WorkerDisconnected, MemoryPressure, AlertmanagerFailedNotifications, NATSPublishCircuitOpen, DLQBacklogGrowing, ProjectionLagHigh, SubscriptionDropsHigh, CodecEncodeErrorsHigh, CodecDecodeErrorsHigh
pg-alerts8CNPGClusterDown, CNPGReplicationLagHigh, CNPGDiskSpaceCritical, CNPGConnectionSaturation, CNPGDiskSpaceWarning, CNPGBackupStale, CNPGHighRollbackRate, CNPGLongRunningTransaction
nats-alerts4NATSJetStreamStorageCritical, NATSSlowConsumers, NATSHighPendingMessages, NATSServerHighMemory

Alert Routing

Alert fired
Alertmanager groups by alertname + namespace
├── Watchdog ──▶ Healthchecks.io (1-minute ping)
│ If pings stop → HC alerts you that monitoring is down
├── InfoInhibitor ──▶ /dev/null (suppressed)
└── All others ──▶ Slack #ironflow-alerts (4-hour repeat)
├── Critical severity suppresses matching Warning alerts
└── Format: severity, summary, description, runbook URL

Verify Alertmanager routing:

Terminal window
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093

Open http://localhost:9093. The Slack and Healthchecks.io routes should be visible under Status → Config.

Public Metrics Endpoint

The Ironflow metrics endpoint is accessible via the public URL:

Terminal window
curl https://demo.ironflow.dev/metrics | head -20

This exposes Prometheus-format metrics including ironflow_runs_total, ironflow_http_requests_total, ironflow_http_request_duration_seconds_bucket, and NATS/scheduler internals.


Security Configuration

AspectSettingDetails
API authenticationdevMode: falseAll requests require API key (enforced by Small template)
Dashboard authPassword (cookie-based)Set on first login
Secrets encryptionmasterKey (AES-256)Encrypts secrets in NATS KV
TLScert-manager + Let’s EncryptAutomatic certificate issuance and renewal
Internal servicesClusterIP onlyPostgreSQL and NATS are not exposed externally
Network policyNot enabledAcceptable for dedicated single-tenant cluster
Container securityNon-root, read-only filesystemrunAsUser: 100, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true
CEL policiesCustom RBAC with CEL expressionsDeny-always-wins, database-driven, decision caching
Auth auditAll auth events loggedAuthorization decisions, key lifecycle, role/policy changes
Platform APIMulti-tenant control plane/api/v1/platform/* (requires platform API key or JWT)

Common Failure Modes

FailureSymptomDiagnosisResolution
HCLOUD_TOKEN expiredterraform apply fails with 401Check token in Hetzner ConsoleRegenerate token, re-export HCLOUD_TOKEN
S3 credentials wrongCNPG pod in CrashLoopBackOffkubectl describe pod -n ironflow <pg-pod>Delete and recreate ironflow-s3-creds secret, delete CNPG pod
DNS not propagatedcert-manager challenge failskubectl describe certificate -n ironflowWait for DNS, then delete the failed Certificate to retry
Let’s Encrypt rate limitedCertificate stuck in False statekubectl describe order -n ironflowWait 1 hour. Use staging issuer for testing: --set ingress.annotations.cert-manager\.io/cluster-issuer=letsencrypt-staging
Slack secret missingAlerts fire but no notificationskubectl logs -n monitoring alertmanager-...Create the alertmanager-slack secret in monitoring namespace
Healthchecks.io secret missingAlertmanager pod fails to startkubectl get pods -n monitoringCreate the healthchecks-io secret in monitoring namespace
masterKey not setSecrets stored unencryptedWarning in Ironflow startup logsRedeploy with --set ironflow.masterKey=...
Worker OOMPods evicted, node pressurekubectl describe node <worker>Verify cpx32 worker type. If cpx22, reprovision with cpx32

Ongoing Maintenance

Upgrade Ironflow

To a specific release:

Terminal window
# clean build
make all
# IMPORTANT: Always include masterKey and ingress flags on upgrades. Helm
# overwrites all values on each upgrade — omitting any causes them to revert
# to defaults (e.g., ingress gets deleted, encrypted secrets become unreadable).
./build/ironflow deploy upgrade --template small --name demo \
--set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
--set image.tag=v0.17.0 \
--set ingress.enabled=true \
--set ingress.host=demo.ironflow.dev \
--set ingress.className=traefik

To latest unreleased code (between releases):

If you want to deploy the latest code from main without cutting a formal release, rebuild and push the latest image tag:

Terminal window
# 1. Build and push a new :latest image from current code
make all
docker build -t ghcr.io/sahina/ironflow:latest \
--build-arg VERSION="$(git rev-parse --short HEAD)" \
--platform linux/amd64 .
docker push ghcr.io/sahina/ironflow:latest
# 2. Force the cluster to pull the new image (since the tag hasn't changed,
# Kubernetes won't re-pull unless the pod is deleted)
./build/ironflow deploy upgrade --template small --name demo \
--set ironflow.masterKey=$IRONFLOW_MASTER_KEY \
--set image.tag=latest \
--set ingress.enabled=true \
--set ingress.host=demo.ironflow.dev \
--set ingress.className=traefik
kubectl rollout restart deployment/demo-ironflow -n ironflow

Why rollout restart?

When the image tag is latest, Kubernetes sees no spec change on upgrade and won’t pull the new image. rollout restart forces a new pod with a fresh pull. For versioned tags (e.g., v0.17.0) this isn’t needed.

Rolling update replaces the pod. Zero downtime for single-replica (brief unavailability during pod restart is acceptable for demo).

Check Backup Health

Terminal window
# Verify latest backup
kubectl get backups -n ironflow --sort-by=.status.startedAt
# Most recent should be < 24 hours old
# Verify backup schedule
kubectl get scheduledbackups -n ironflow

Access Monitoring

Terminal window
# Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3030:80
# Prometheus
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Alertmanager
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093

Renew TLS Certificate

cert-manager automatically renews the Let’s Encrypt certificate 30 days before expiry. No manual action required. To check status:

Terminal window
kubectl get certificate -n ironflow
# READY should be True, EXPIRATION shows the current cert expiry

Teardown

To completely remove the demo cluster:

Terminal window
# 1. Remove Ironflow Helm release (deletes pods, services, PVCs)
./build/ironflow deploy delete --name demo
# 2. Destroy Hetzner infrastructure (nodes, LB, volumes)
./build/ironflow provision destroy --provider hetzner --name demo
# 3. Clean up DNS
# Manually remove the A record for demo.ironflow.dev at your DNS provider
# 4. Clean up local kubeconfig
rm ~/.kube/clusters/hetzner-demo.yaml

Partial teardown

To remove just Ironflow but keep the cluster: run only step 1. To remove monitoring: helm uninstall kube-prometheus-stack -n monitoring.


Verification Checklist

Use this checklist after completing all phases:

  • Cluster: 2 nodes Ready (kubectl get nodes)
  • Pods: All Running in ironflow, monitoring, cnpg-system, traefik namespaces
  • PostgreSQL: Cluster healthy (kubectl get cluster -n ironflow)
  • Backups: ScheduledBackup configured (kubectl get scheduledbackups -n ironflow)
  • Health: https://demo.ironflow.dev/health returns 200
  • Readiness: https://demo.ironflow.dev/ready returns {"status":"ok"}
  • Metrics: https://demo.ironflow.dev/metrics returns Prometheus metrics
  • Grafana: 4 dashboards load with data at localhost:3030
  • Prometheus: All targets show UP at localhost:9090/targets
  • Capabilities: curl .../api/v1/capabilities returns a valid JSON payload
  • Alerts: All rules loaded (kubectl get prometheusrules -n ironflow)
  • Slack: Alert routing configured (check Alertmanager UI)
  • Healthchecks.io: Watchdog pings arriving (check Healthchecks.io dashboard)
  • TLS: Valid certificate (kubectl get certificate -n ironflow, READY=True)

Upgrade Path

When you outgrow the Small template:

TriggerAction
Need zero-downtime deploysUpgrade to Medium (3 replicas, NATS cluster)
>1000 concurrent active runsUpgrade to Medium
Need PostgreSQL HAUpgrade to Medium (2 PG instances, PgBouncer)
Need auto-scalingUpgrade to Large (HPA, external deps)

Small to Medium requires a full redeploy (NATS topology changes from 1 to 3 nodes). Data survives via S3 backups + CNPG restore. Medium cost: ~€44/month.

See Scaling Scenarios for detailed upgrade procedures.