Maintenance Scenarios

Drain a node for maintenance

Trigger: Hetzner maintenance window, need to resize a VM, or kernel update on a node.

If you are running the Small template with a single worker node, draining that node means downtime for all Ironflow pods until you uncordon or add another node.

Steps:

# 1. Check what is running on the target node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide

# 2. If the node hosts the PostgreSQL primary, perform a planned failover first
#    See the "PostgreSQL failover" scenario in the Scaling guide

# 3. Drain the node (evicts all non-DaemonSet pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# --ignore-daemonsets: DaemonSets (e.g., monitoring agents) cannot be evicted
# --delete-emptydir-data: allows eviction of pods using emptyDir volumes

The Large template includes a PodDisruptionBudget (minAvailable: 1), which ensures at least one Ironflow pod stays running during the drain. Kubernetes will not evict a pod if doing so would violate the PDB — the drain command blocks until it is safe to proceed.

# 4. Wait for evicted pods to reschedule on other nodes
kubectl get pods -n ironflow -l app.kubernetes.io/component=server -o wide
# Pods should show Running on a different node

# 5. Perform your maintenance (resize VM, reboot, etc.)

# 6. Make the node schedulable again
kubectl uncordon <node-name>

# 7. Verify the node is Ready and accepting pods
kubectl get nodes
# STATUS should show "Ready" (not "Ready,SchedulingDisabled")

Note: Uncordoning does not move pods back to the node. Existing pods stay where they are. New pods (or pods from a future rollout) may be scheduled onto the uncordoned node.

Update Helm chart values without redeploying

Trigger: Need to change a configuration value (log level, replica count, resource limits) without rebuilding or changing the image.

Steps:

The Ironflow Helm chart includes checksum annotations on the deployment template. When a helm upgrade changes the ConfigMap or Secret content, pods automatically restart with the new configuration.

# General pattern using the Ironflow CLI:
ironflow deploy upgrade --template medium --name my-release --set <key>=<value>

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set <key>=<value>

Common examples:

# Enable debug logging
ironflow deploy upgrade --template medium --name my-release --set ironflow.logLevel=debug

# Change replica count
ironflow deploy upgrade --template medium --name my-release --set replicaCount=5

# Increase memory limit
ironflow deploy upgrade --template medium --name my-release --set resources.limits.memory=1Gi

# Enable Prometheus metrics
ironflow deploy upgrade --template medium --name my-release \
  --set observability.metrics.enabled=true

# Multiple values at once
ironflow deploy upgrade --template medium --name my-release \
  --set ironflow.logLevel=info \
  --set replicaCount=3 \
  --set resources.requests.memory=512Mi

# Verify the rollout completed
kubectl rollout status deployment/ironflow -n ironflow

# Confirm the new config is in effect
kubectl get configmap ironflow-config -n ironflow -o yaml

For details on viewing the full ConfigMap and Secret contents, see the “Configuration” section in kubectl Operations.

Backup and restore Prometheus data

Trigger: Migrating monitoring to a new cluster, or the Prometheus PVC needs replacement.

Steps:

# 1. Identify the Prometheus PVC
kubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus
# Typically named: prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0

Option A: VolumeSnapshot (if supported by storage class)

# Check if the storage class supports snapshots
kubectl get volumesnapshotclass

# Create a snapshot
cat <<EOF | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: prometheus-backup
  namespace: monitoring
spec:
  volumeSnapshotClassName: <your-snapshot-class>
  source:
    persistentVolumeClaimName: <prometheus-pvc-name>
EOF

# Verify snapshot is ready
kubectl get volumesnapshot prometheus-backup -n monitoring

To restore from a snapshot, create a new PVC referencing the snapshot:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-restored
  namespace: monitoring
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: <your-storage-class>
  resources:
    requests:
      storage: 50Gi
  dataSource:
    name: prometheus-backup
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
EOF

Option B: Manual tar backup (works with any storage class)

# Scale down Prometheus to ensure consistent data
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring --replicas=0

# Run a temporary pod that mounts the PVC (keep it running so you can copy data out)
kubectl run prom-backup --image=alpine -n monitoring \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "prom-backup",
        "image": "alpine",
        "command": ["sleep", "3600"],
        "volumeMounts": [
          {"name": "prom-data", "mountPath": "/data"}
        ]
      }],
      "volumes": [
        {"name": "prom-data", "persistentVolumeClaim": {"claimName": "<prometheus-pvc-name>"}}
      ]
    }
  }'

# Wait for the pod to start
kubectl wait --for=condition=Ready pod/prom-backup -n monitoring --timeout=60s

# Stream the tar directly to your local machine (avoids using pod storage)
kubectl exec -n monitoring prom-backup -- tar cz -C /data . > prometheus-data.tar.gz

# Clean up the temporary pod and scale Prometheus back up
kubectl delete pod prom-backup -n monitoring
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring --replicas=1

Prometheus retention is 15 days by default. For most cases, losing Prometheus data is acceptable since metrics regenerate from live scrape targets. Only back up Prometheus if you need historical data for a specific investigation or audit.

Clean up old PrometheusRule or dashboard versions

Trigger: Alert rules have been updated in the repository, but old PrometheusRule or ConfigMap resources are lingering in the cluster from a previous apply.

Steps:

# 1. List current PrometheusRules in the monitoring namespace
kubectl get prometheusrules -n monitoring

# 2. Re-apply alert rules and dashboards via Helm upgrade
# Dashboards: deploy/helm/ironflow/dashboards/
helm upgrade ironflow deploy/helm/ironflow/ -n ironflow --reuse-values \
  --set monitoring.alerts.enabled=true

# 3. Grafana dashboards are managed by the Helm chart as ConfigMaps
# Re-deployed automatically with the helm upgrade above

# 4. Verify alert rules are loaded in Prometheus
#    Port-forward Prometheus and check the /rules page
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090/rules in your browser
# All rules should show "OK" health status

# 5. Verify dashboards are loaded in Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
# Open http://localhost:3000 and check that dashboards appear under the Ironflow folder

# 6. (Optional) Validate alert rule syntax before applying — render the chart to stdin
helm template ironflow deploy/helm/ironflow/ --show-only templates/ironflow-alerts.yaml \
  | promtool check rules /dev/stdin

Note: Alert rules are packaged with the Ironflow Helm chart. To update them, edit the template (deploy/helm/ironflow/templates/ironflow-alerts.yaml) and re-run helm upgrade — do not kubectl apply individual rule files directly, or they will drift back on the next chart upgrade.

Audit tenant network isolation (multi-tenant cluster)

Trigger: Periodic security audit or before onboarding a new tenant — verify that namespace isolation is working correctly.

Steps:

# 1. List all tenant namespaces
kubectl get ns | grep tenant-

# 2. For each tenant, verify the full set of NetworkPolicies exist
for ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl get networkpolicy -n "$ns" -o custom-columns=NAME:.metadata.name
  echo ""
done
# Each namespace should have 4 policies:
#   *-default-deny, *-allow-dns, *-allow-intra-namespace, *-ironflow

# 3. Verify ResourceQuotas are enforced
for ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl describe resourcequota -n "$ns" | grep -E "Resource|requests|limits|pods"
  echo ""
done

# 4. Cross-tenant connectivity test (pick any two tenants)
kubectl exec -n tenant-acme deploy/acme-ironflow -- \
  wget -qO- --timeout=3 http://globex-ironflow.tenant-globex:9123/health 2>&1 || echo "Blocked (expected)"

Bulk-upgrade all tenants (multi-tenant cluster)

Trigger: A new Ironflow version is released and all tenants need to be updated.

Steps:

# 1. List all tenant Helm releases
helm list --all-namespaces | grep tenant-

# 2. Upgrade each tenant (adjust the loop for your naming convention)
for ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do
  release=$(helm list -n "$ns" -q)
  if [ -n "$release" ]; then
    echo "Upgrading $release in $ns..."
    helm upgrade "$release" ./deploy/helm/ironflow \
      -n "$ns" \
      -f deploy/helm/ironflow/values-multi-tenant.yaml \
      --reuse-values \
      --set image.tag=v0.17.0
  fi
done

# 3. Watch all tenant pods roll out
for ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do
  echo "=== $ns ==="
  kubectl rollout status deployment -n "$ns" --timeout=120s 2>/dev/null || echo "  (no deployment or timeout)"
done

# 4. Spot-check health on a few tenants
for ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2 | head -3); do
  pod=$(kubectl get pods -n "$ns" -l app.kubernetes.io/component=server -o name | head -1)
  if [ -n "$pod" ]; then
    echo "$ns: $(kubectl exec -n "$ns" "$pod" -- wget -qO- http://localhost:9123/health 2>/dev/null)"
  fi
done

Rolling upgrades

Upgrade tenants one at a time if you want to catch issues before they affect everyone. The loop above upgrades all tenants in sequence. For large deployments, consider adding a health check between upgrades and aborting on failure.

Rotate a tenant’s master key (multi-tenant cluster)

Trigger: A tenant’s master key may be compromised, or periodic key rotation is required.

Secrets re-encryption required

Changing the master key means existing encrypted secrets become unreadable. You must re-encrypt them with the new key.

Steps:

# 1. Generate a new master key
NEW_KEY=$(openssl rand -hex 32)

# 2. Export existing secrets while the old key is still active
kubectl exec -n tenant-acme deploy/acme-ironflow -- ironflow secret list
# Note all secret names, then export their values

# 3. Upgrade with the new master key
helm upgrade acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  --reuse-values \
  --set ironflow.masterKey=$NEW_KEY

# 4. Wait for the rollout
kubectl rollout status deployment/acme-ironflow -n tenant-acme

# 5. Re-set all secrets with the new key
# (Ironflow encrypts with the new master key on write)
kubectl exec -n tenant-acme deploy/acme-ironflow -- ironflow secret set MY_SECRET "old-value"
# Repeat for each secret