Scaling Scenarios

Scale Ironflow horizontally (add replicas)

Trigger: Traffic increasing, latency rising, or preparing for a load test.

Scaling beyond 1 replica requires cluster.enabled: true in Helm values. The Medium and Large templates already include this. If you are on the Small template (single replica, no clustering), upgrade to Medium first before scaling out.

Steps:

# 1. If using Large template with HPA, check whether auto-scaling is already handling it
kubectl get hpa ironflow -n ironflow
kubectl describe hpa ironflow -n ironflow
# Look at TARGETS column — if CPU is below the threshold (70% by default),
# HPA has not triggered yet. You can lower the threshold or add replicas manually.

# 2. For Medium template (or Large with HPA disabled), scale manually
#    Example: scale from 3 to 5 replicas
ironflow deploy upgrade --template medium --name my-release --set replicaCount=5

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set replicaCount=5

# 3. Verify pods are running
kubectl get pods -n ironflow -l app.kubernetes.io/component=server
# All replicas should show STATUS=Running within ~30 seconds

Post-checks: See kubectl Operations for health checks and log tailing across multiple replicas.

Scale down to save cost (off-hours / evaluation)

Trigger: No traffic expected, want to minimize Hetzner spend.

Steps:

Option A: Scale replicas to 1 (reversible, keeps cluster running)

# Scale Ironflow down to a single replica
ironflow deploy upgrade --template medium --name my-release --set replicaCount=1

# If using Large template with HPA, also lower minReplicas to prevent auto-scale-up
ironflow deploy upgrade --template large --name my-release \
  --set replicaCount=1 --set autoscaling.minReplicas=1

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set replicaCount=1

# Verify only one pod remains
kubectl get pods -n ironflow -l app.kubernetes.io/component=server

To scale back up later, run the same helm upgrade with the desired replica count.

Option B: Destroy the entire cluster to stop billing

# Destroy all Hetzner infrastructure (VMs, volumes, load balancers)
ironflow provision destroy --provider hetzner --name ironflow

Option B destroys everything, including persistent volumes. You must have database backups to restore data. Use this only for evaluation or disposable environments. Option A is always reversible.

Add a worker node to the cluster

Trigger: Pods pending due to insufficient resources, or monitoring shows memory/CPU pressure across nodes. Applies to Hetzner/Terraform-managed clusters only.

Steps:

# 1. Confirm pods are pending due to resource constraints
kubectl get pods -n ironflow --field-selector=status.phase=Pending
kubectl describe pod <pending-pod-name> -n ironflow
# Look for "Insufficient cpu" or "Insufficient memory" in Events

# 2. Check current node capacity
kubectl top nodes
kubectl get nodes -o wide

# 3. Edit Terraform variables to increase worker count (or use a larger server type)
#    In deploy/terraform/hetzner/terraform.tfvars:
#      worker_count = 3          # was 2
#      worker_type  = "cpx32"    # or upgrade to "cpx42" for more resources

# 4. Apply the change
cd deploy/terraform/hetzner
terraform plan    # Review what will be created
terraform apply   # Provision the new node

# 5. Wait for the new node to join (~2 minutes for Talos to bootstrap)
kubectl get nodes -w
# The new node appears as NotReady, then transitions to Ready

# 6. Verify the pending pods get scheduled
kubectl get pods -n ironflow -l app.kubernetes.io/component=server

Note: Adding a node does not automatically redistribute existing pods. Kubernetes schedules new or rescheduled pods onto the node. If you need to rebalance, use kubectl rollout restart deployment/ironflow -n ironflow.

Scale NATS cluster (add/remove replicas)

Trigger: JetStream storage filling up, or reducing from 3-node to 1-node for cost. Applies to Medium template only (bundled NATS cluster).

Never scale a 3-node NATS cluster below 2 replicas without disabling clustering entirely. Losing quorum (majority of nodes) causes JetStream to become read-only, blocking all event processing.

Steps:

Scaling up (e.g., 3 to 5 replicas):

ironflow deploy upgrade --template medium --name my-release \
  --set nats.config.cluster.replicas=5

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set nats.config.cluster.replicas=5

# Verify NATS cluster membership
kubectl get pods -n ironflow -l app.kubernetes.io/name=nats
kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/routez

Scaling down to a single node (switch from Medium to Small):

To go from a 3-node NATS cluster to a single node, disable clustering entirely. This effectively moves from a Medium to a Small-like NATS configuration:

ironflow deploy upgrade --template small --name my-release \
  --set nats.config.cluster.enabled=false \
  --set nats.config.cluster.replicas=1

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values \
#   --set nats.config.cluster.enabled=false \
#   --set nats.config.cluster.replicas=1

Increasing JetStream PVC size:

If JetStream storage is filling up (check with kubectl exec -n ironflow ironflow-nats-0 -c nats -- wget -qO- http://localhost:8222/jsz), you can increase the PVC size if the storage class supports volume expansion:

# Check if the storage class allows expansion
kubectl get storageclass -o jsonpath='{range .items[*]}{.metadata.name}: {.allowVolumeExpansion}{"\n"}{end}'

# Patch the PVC (example: expand from 10Gi to 20Gi)
kubectl patch pvc ironflow-nats-js-ironflow-nats-0 -n ironflow \
  -p '{"spec": {"resources": {"requests": {"storage": "20Gi"}}}}'

If volume expansion is not supported, you need to delete the StatefulSet (keeping pods running) and recreate with the new size. StatefulSet volumeClaimTemplates are immutable, so a normal helm upgrade changing the PVC size will fail:

# 1. Delete the StatefulSet without deleting pods (they keep running)
kubectl delete statefulset ironflow-nats -n ironflow --cascade=orphan

# 2. Delete the PVCs (data loss for affected replicas — cluster resyncs from remaining nodes)
kubectl delete pvc -l app.kubernetes.io/name=nats -n ironflow

# 3. Redeploy with the new PVC size
ironflow deploy upgrade --template medium --name my-release \
  --set nats.config.jetstream.fileStore.pvc.size=20Gi

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set nats.config.jetstream.fileStore.pvc.size=20Gi

Connection pooling with PgBouncer

When running multiple Ironflow replicas (especially with HPA), each replica opens its own connection pool to PostgreSQL (default: 25 connections per replica). At 10 replicas, that’s 250 connections against PostgreSQL.

Ironflow’s Helm chart includes a CNPG Pooler (PgBouncer) template:

postgresql:
  pooler:
    enabled: true
    instances: 2
    type: rw
    poolMode: transaction
    parameters:
      max_client_conn: "1000"
      default_pool_size: "25"

This deploys PgBouncer pods that multiplex application connections into a smaller pool of PostgreSQL connections. The transaction pool mode releases the server connection back to the pool after each transaction completes.

When the pooler is enabled, Ironflow automatically routes its database connection through PgBouncer via the CNPG-generated pooler secret.

Check pooler status:

kubectl get poolers -n <namespace>
kubectl get pods -l cnpg.io/poolerName=<pooler-name> -n <namespace>

PostgreSQL failover (planned maintenance)

Trigger: Need to drain a node running the PG primary, or perform PG version upgrade. Requires the kubectl-cnpg plugin (install via kubectl krew install cnpg).

Steps:

Small/Medium (bundled)
Large / Hetzner bootstrap

# 1. Check current primary
kubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql -L role
# The pod with role=primary is the current leader

# 2. Trigger a controlled switchover (promotes a replica to primary)
kubectl cnpg switchover ironflow-postgresql -n ironflow

# 3. Verify the new primary
kubectl get pods -n ironflow -l cnpg.io/cluster=ironflow-postgresql -L role
# A different pod should now show role=primary

# 1. Check current primary
kubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db -L role

# 2. Trigger a controlled switchover
kubectl cnpg switchover ironflow-db -n ironflow-system

# 3. Verify the new primary
kubectl get pods -n ironflow-system -l cnpg.io/cluster=ironflow-db -L role

# 4. Now it is safe to drain the node that was running the old primary
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Alternative (automatic failover): If you do not have the kubectl-cnpg plugin, you can simply drain the node. CNPG detects the primary pod termination and automatically promotes a replica. This causes a brief write outage (typically 5-10 seconds) during the automatic failover window. Ironflow reconnects automatically.

Note: The Small template runs a single PG instance with no replicas, so failover is not available. Draining the node running PG on a Small deployment means database downtime until the pod is rescheduled.

Scale a tenant’s resources (multi-tenant cluster)

Trigger: A tenant needs more CPU, memory, storage, or a different resource quota than what values-multi-tenant.yaml provides by default.

Steps:

# 1. Check the tenant's current resource usage against their quota
kubectl describe resourcequota -n tenant-acme

# 2. Upgrade the tenant with higher limits
helm upgrade acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  --set ingress.host=acme.ironflow.example.com \
  --set resourceQuota.cpu.limits=8 \
  --set resourceQuota.memory.limits=16Gi \
  --set resourceQuota.storage=100Gi \
  --set postgresql.persistence.size=20Gi \
  --set resources.requests.cpu=500m \
  --set resources.limits.memory=1Gi

# 3. Verify the quota was updated
kubectl get resourcequota -n tenant-acme -o yaml

PVC expansion

Increasing postgresql.persistence.size only works if your StorageClass supports volume expansion (allowVolumeExpansion: true). If it does not, you need to back up and restore the database to a new PVC. CNPG does not support online PVC resize directly — you must update the Cluster CR and let CNPG handle the rolling replacement.

To expand NATS JetStream storage for a tenant:

helm upgrade acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  --set ingress.host=acme.ironflow.example.com \
  --set nats.config.jetstream.fileStore.pvc.size=10Gi