Scaling Scenarios
Scale Ironflow horizontally (add replicas)
Trigger: Traffic increasing, latency rising, or preparing for a load test.
Scaling beyond 1 replica requires cluster.enabled: true in Helm values. The Medium and Large templates already include this. If you are on the Small template (single replica, no clustering), upgrade to Medium first before scaling out.
Steps:
# 1. If using Large template with HPA, check whether auto-scaling is already handling itkubectl get hpa ironflow -n ironflowkubectl describe hpa ironflow -n ironflow# Look at TARGETS column — if CPU is below the threshold (70% by default),# HPA has not triggered yet. You can lower the threshold or add replicas manually.# 2. For Medium template (or Large with HPA disabled), scale manually# Example: scale from 3 to 5 replicasironflow deploy upgrade --template medium --name my-release --set replicaCount=5
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set replicaCount=5# 3. Verify pods are runningkubectl get pods -n ironflow -l app.kubernetes.io/component=server# All replicas should show STATUS=Running within ~30 secondsPost-checks: See kubectl Operations for health checks and log tailing across multiple replicas.
Scale down to save cost (off-hours / evaluation)
Trigger: No traffic expected, want to minimize Hetzner spend.
Steps:
Option A: Scale replicas to 1 (reversible, keeps cluster running)
# Scale Ironflow down to a single replicaironflow deploy upgrade --template medium --name my-release --set replicaCount=1
# If using Large template with HPA, also lower minReplicas to prevent auto-scale-upironflow deploy upgrade --template large --name my-release \ --set replicaCount=1 --set autoscaling.minReplicas=1
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set replicaCount=1# Verify only one pod remainskubectl get pods -n ironflow -l app.kubernetes.io/component=serverTo scale back up later, run the same helm upgrade with the desired replica count.
Option B: Destroy the entire cluster to stop billing
# Destroy all Hetzner infrastructure (VMs, volumes, load balancers)ironflow provision destroy --provider hetzner --name ironflowOption B destroys everything, including persistent volumes. You must have database backups to restore data. Use this only for evaluation or disposable environments. Option A is always reversible.
Add a worker node to the cluster
Trigger: Pods pending due to insufficient resources, or monitoring shows memory/CPU pressure across nodes. Applies to Hetzner/Terraform-managed clusters only.
Steps:
# 1. Confirm pods are pending due to resource constraintskubectl get pods -n ironflow --field-selector=status.phase=Pendingkubectl describe pod <pending-pod-name> -n ironflow# Look for "Insufficient cpu" or "Insufficient memory" in Events# 2. Check current node capacitykubectl top nodeskubectl get nodes -o wide# 3. Edit Terraform variables to increase worker count (or use a larger server type)# In deploy/terraform/hetzner/terraform.tfvars:# worker_count = 3 # was 2# worker_type = "cpx32" # or upgrade to "cpx42" for more resources# 4. Apply the changecd deploy/terraform/hetznerterraform plan # Review what will be createdterraform apply # Provision the new node# 5. Wait for the new node to join (~2 minutes for Talos to bootstrap)kubectl get nodes -w# The new node appears as NotReady, then transitions to Ready# 6. Verify the pending pods get scheduledkubectl get pods -n ironflow -l app.kubernetes.io/component=serverNote: Adding a node does not automatically redistribute existing pods. Kubernetes schedules new or rescheduled pods onto the node. If you need to rebalance, use kubectl rollout restart deployment/ironflow -n ironflow.
Scale NATS cluster (add/remove replicas)
Trigger: JetStream storage filling up, or reducing from 3-node to 1-node for cost. Applies to Medium template only (bundled NATS cluster).
Never scale a 3-node NATS cluster below 2 replicas without disabling clustering entirely. Losing quorum (majority of nodes) causes JetStream to become read-only, blocking all event processing.
Steps:
Scaling up (e.g., 3 to 5 replicas):
ironflow deploy upgrade --template medium --name my-release \ --set nats.config.cluster.replicas=5
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set nats.config.cluster.replicas=5# Verify NATS cluster membershipkubectl get pods -n ironflow -l app.kubernetes.io/name=natskubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/routezScaling down to a single node (switch from Medium to Small):
To go from a 3-node NATS cluster to a single node, disable clustering entirely. This effectively moves from a Medium to a Small-like NATS configuration:
ironflow deploy upgrade --template small --name my-release \ --set nats.config.cluster.enabled=false \ --set nats.config.cluster.replicas=1
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values \# --set nats.config.cluster.enabled=false \# --set nats.config.cluster.replicas=1Increasing JetStream PVC size:
If JetStream storage is filling up (check with kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/jsz), you can increase the PVC size if the storage class supports volume expansion:
# Check if the storage class allows expansionkubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}: {.allowVolumeExpansion}{"\n"}{end}'
# Patch the PVC (example: expand from 10Gi to 20Gi)kubectl patch pvc ironflow-nats-js-ironflow-nats-0 -n ironflow \ -p '{"spec": {"resources": {"requests": {"storage": "20Gi"}}}}'If volume expansion is not supported, you need to delete the StatefulSet (keeping pods running) and recreate with the new size. StatefulSet volumeClaimTemplates are immutable, so a normal helm upgrade changing the PVC size will fail:
# 1. Delete the StatefulSet without deleting pods (they keep running)kubectl delete statefulset ironflow-nats -n ironflow --cascade=orphan
# 2. Delete the PVCs (data loss for affected replicas — cluster resyncs from remaining nodes)kubectl delete pvc -l app.kubernetes.io/name=nats -n ironflow
# 3. Redeploy with the new PVC sizeironflow deploy upgrade --template medium --name my-release \ --set nats.config.jetstream.fileStore.pvc.size=20Gi
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set nats.config.jetstream.fileStore.pvc.size=20GiConnection pooling with PgBouncer
When running multiple Ironflow replicas (especially with HPA), each replica opens its own connection pool to PostgreSQL (default: 25 connections per replica). At 10 replicas, that’s 250 connections against PostgreSQL.
Ironflow’s Helm chart includes a CNPG Pooler (PgBouncer) template:
postgresql: pooler: enabled: true instances: 2 type: rw poolMode: transaction parameters: max_client_conn: "1000" default_pool_size: "25"This deploys PgBouncer pods that multiplex application connections into a smaller pool
of PostgreSQL connections. The transaction pool mode releases the server connection
back to the pool after each transaction completes.
When the pooler is enabled, Ironflow automatically routes its database connection through PgBouncer via the CNPG-generated pooler secret.
Check pooler status:
kubectl get poolers -n <namespace>kubectl get pods -l cnpg.io/poolerName=<pooler-name> -n <namespace>PostgreSQL failover (planned maintenance)
Trigger: Need to drain a node running the PG primary, or perform PG version upgrade. Requires the kubectl-cnpg plugin (install via kubectl krew install cnpg).
Steps:
# 1. Check current primarykubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql -L role# The pod with role=primary is the current leader
# 2. Trigger a controlled switchover (promotes a replica to primary)kubectl cnpg switchover ironflow-postgresql -n ironflow
# 3. Verify the new primarykubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql -L role# A different pod should now show role=primary# 1. Check current primarykubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db -L role
# 2. Trigger a controlled switchoverkubectl cnpg switchover ironflow-db -n ironflow-system
# 3. Verify the new primarykubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db -L role# 4. Now it is safe to drain the node that was running the old primarykubectl drain <node-name> --ignore-daemonsets --delete-emptydir-dataAlternative (automatic failover): If you do not have the kubectl-cnpg plugin, you can simply drain the node. CNPG detects the primary pod termination and automatically promotes a replica. This causes a brief write outage (typically 5-10 seconds) during the automatic failover window. Ironflow reconnects automatically.
Note: The Small template runs a single PG instance with no replicas, so failover is not available. Draining the node running PG on a Small deployment means database downtime until the pod is rescheduled.
Scale a tenant’s resources (multi-tenant cluster)
Trigger: A tenant needs more CPU, memory, storage, or a different resource quota than what values-multi-tenant.yaml provides by default.
Steps:
# 1. Check the tenant's current resource usage against their quotakubectl describe resourcequota -n tenant-acme
# 2. Upgrade the tenant with higher limitshelm upgrade acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --set ingress.host=acme.ironflow.example.com \ --set resourceQuota.cpu.limits=8 \ --set resourceQuota.memory.limits=16Gi \ --set resourceQuota.storage=100Gi \ --set postgresql.persistence.size=20Gi \ --set resources.requests.cpu=500m \ --set resources.limits.memory=1Gi
# 3. Verify the quota was updatedkubectl get resourcequota -n tenant-acme -o yamlPVC expansion
Increasing postgresql.persistence.size only works if your StorageClass supports volume expansion (allowVolumeExpansion: true). If it does not, you need to back up and restore the database to a new PVC. CNPG does not support online PVC resize directly — you must update the Cluster CR and let CNPG handle the rolling replacement.
To expand NATS JetStream storage for a tenant:
helm upgrade acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --set ingress.host=acme.ironflow.example.com \ --set nats.config.jetstream.fileStore.pvc.size=10Gi