Maintenance Scenarios
Drain a node for maintenance
Trigger: Hetzner maintenance window, need to resize a VM, or kernel update on a node.
If you are running the Small template with a single worker node, draining that node means downtime for all Ironflow pods until you uncordon or add another node.
Steps:
# 1. Check what is running on the target nodekubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> -o wide# 2. If the node hosts the PostgreSQL primary, perform a planned failover first# See the "PostgreSQL failover" scenario in the Scaling guide# 3. Drain the node (evicts all non-DaemonSet pods)kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data# --ignore-daemonsets: DaemonSets (e.g., monitoring agents) cannot be evicted# --delete-emptydir-data: allows eviction of pods using emptyDir volumesThe Large template includes a PodDisruptionBudget (minAvailable: 1), which ensures at least one Ironflow pod stays running during the drain. Kubernetes will not evict a pod if doing so would violate the PDB — the drain command blocks until it is safe to proceed.
# 4. Wait for evicted pods to reschedule on other nodeskubectl get pods -n ironflow -l app.kubernetes.io/component=server -o wide# Pods should show Running on a different node# 5. Perform your maintenance (resize VM, reboot, etc.)# 6. Make the node schedulable againkubectl uncordon <node-name># 7. Verify the node is Ready and accepting podskubectl get nodes# STATUS should show "Ready" (not "Ready,SchedulingDisabled")Note: Uncordoning does not move pods back to the node. Existing pods stay where they are. New pods (or pods from a future rollout) may be scheduled onto the uncordoned node.
Update Helm chart values without redeploying
Trigger: Need to change a configuration value (log level, replica count, resource limits) without rebuilding or changing the image.
Steps:
The Ironflow Helm chart includes checksum annotations on the deployment template. When a helm upgrade changes the ConfigMap or Secret content, pods automatically restart with the new configuration.
# General pattern using the Ironflow CLI:ironflow deploy upgrade --template medium --name my-release --set <key>=<value>
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set <key>=<value>Common examples:
# Enable debug loggingironflow deploy upgrade --template medium --name my-release --set ironflow.logLevel=debug
# Change replica countironflow deploy upgrade --template medium --name my-release --set replicaCount=5
# Increase memory limitironflow deploy upgrade --template medium --name my-release --set resources.limits.memory=1Gi
# Enable Prometheus metricsironflow deploy upgrade --template medium --name my-release \ --set observability.metrics.enabled=true
# Multiple values at onceironflow deploy upgrade --template medium --name my-release \ --set ironflow.logLevel=info \ --set replicaCount=3 \ --set resources.requests.memory=512Mi# Verify the rollout completedkubectl rollout status deployment/ironflow -n ironflow
# Confirm the new config is in effectkubectl get configmap ironflow-config -n ironflow -o yamlFor details on viewing the full ConfigMap and Secret contents, see the “Configuration” section in kubectl Operations.
Backup and restore Prometheus data
Trigger: Migrating monitoring to a new cluster, or the Prometheus PVC needs replacement.
Steps:
# 1. Identify the Prometheus PVCkubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus# Typically named: prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0Option A: VolumeSnapshot (if supported by storage class)
# Check if the storage class supports snapshotskubectl get volumesnapshotclass
# Create a snapshotcat <<EOF | kubectl apply -f -apiVersion: snapshot.storage.k8s.io/v1kind: VolumeSnapshotmetadata: name: prometheus-backup namespace: monitoringspec: volumeSnapshotClassName: <your-snapshot-class> source: persistentVolumeClaimName: <prometheus-pvc-name>EOF
# Verify snapshot is readykubectl get volumesnapshot prometheus-backup -n monitoringTo restore from a snapshot, create a new PVC referencing the snapshot:
cat <<EOF | kubectl apply -f -apiVersion: v1kind: PersistentVolumeClaimmetadata: name: prometheus-restored namespace: monitoringspec: accessModes: [ReadWriteOnce] storageClassName: <your-storage-class> resources: requests: storage: 50Gi dataSource: name: prometheus-backup kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.ioEOFOption B: Manual tar backup (works with any storage class)
# Scale down Prometheus to ensure consistent datakubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring --replicas=0
# Run a temporary pod that mounts the PVC (keep it running so you can copy data out)kubectl run prom-backup --image=alpine -n monitoring \ --overrides='{ "spec": { "containers": [{ "name": "prom-backup", "image": "alpine", "command": ["sleep", "3600"], "volumeMounts": [ {"name": "prom-data", "mountPath": "/data"} ] }], "volumes": [ {"name": "prom-data", "persistentVolumeClaim": {"claimName": "<prometheus-pvc-name>"}} ] } }'
# Wait for the pod to startkubectl wait --for=condition=Ready pod/prom-backup -n monitoring --timeout=60s
# Stream the tar directly to your local machine (avoids using pod storage)kubectl exec -n monitoring prom-backup -- tar cz -C /data . > prometheus-data.tar.gz
# Clean up the temporary pod and scale Prometheus back upkubectl delete pod prom-backup -n monitoringkubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring --replicas=1Prometheus retention is 15 days by default. For most cases, losing Prometheus data is acceptable since metrics regenerate from live scrape targets. Only back up Prometheus if you need historical data for a specific investigation or audit.
Clean up old PrometheusRule or dashboard versions
Trigger: Alert rules have been updated in the repository, but old PrometheusRule or ConfigMap resources are lingering in the cluster from a previous apply.
Steps:
# 1. List current PrometheusRules in the monitoring namespacekubectl get prometheusrules -n monitoring# 2. Re-apply alert rules and dashboards via Helm upgrade# Dashboards: deploy/helm/ironflow/dashboards/helm upgrade ironflow deploy/helm/ironflow/ -n ironflow --reuse-values \ --set monitoring.alerts.enabled=true# 3. Grafana dashboards are managed by the Helm chart as ConfigMaps# Re-deployed automatically with the helm upgrade above# 4. Verify alert rules are loaded in Prometheus# Port-forward Prometheus and check the /rules pagekubectl port-forward svc/prometheus-operated -n monitoring 9090:9090# Open http://localhost:9090/rules in your browser# All rules should show "OK" health status# 5. Verify dashboards are loaded in Grafanakubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80# Open http://localhost:3000 and check that dashboards appear under the Ironflow folder# 6. (Optional) Validate alert rule syntax before applying — render the chart to stdinhelm template ironflow deploy/helm/ironflow/ --show-only templates/ironflow-alerts.yaml \ | promtool check rules /dev/stdinNote: Alert rules are packaged with the Ironflow Helm chart. To update them, edit the template
(deploy/helm/ironflow/templates/ironflow-alerts.yaml) and re-run helm upgrade — do not kubectl apply individual rule files directly, or they will drift back on the next chart upgrade.
Audit tenant network isolation (multi-tenant cluster)
Trigger: Periodic security audit or before onboarding a new tenant — verify that namespace isolation is working correctly.
Steps:
# 1. List all tenant namespaceskubectl get ns | grep tenant-
# 2. For each tenant, verify the full set of NetworkPolicies existfor ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do echo "=== $ns ===" kubectl get networkpolicy -n "$ns" -o custom-columns=NAME:.metadata.name echo ""done# Each namespace should have 4 policies:# *-default-deny, *-allow-dns, *-allow-intra-namespace, *-ironflow
# 3. Verify ResourceQuotas are enforcedfor ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do echo "=== $ns ===" kubectl describe resourcequota -n "$ns" | grep -E "Resource|requests|limits|pods" echo ""done
# 4. Cross-tenant connectivity test (pick any two tenants)kubectl exec -n tenant-acme deploy/acme-ironflow -- \ wget -qO- --timeout=3 http://globex-ironflow.tenant-globex:9123/health 2>&1 || echo "Blocked (expected)"Bulk-upgrade all tenants (multi-tenant cluster)
Trigger: A new Ironflow version is released and all tenants need to be updated.
Steps:
# 1. List all tenant Helm releaseshelm list --all-namespaces | grep tenant-
# 2. Upgrade each tenant (adjust the loop for your naming convention)for ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do release=$(helm list -n "$ns" -q) if [ -n "$release" ]; then echo "Upgrading $release in $ns..." helm upgrade "$release" ./deploy/helm/ironflow \ -n "$ns" \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --reuse-values \ --set image.tag=v0.17.0 fidone
# 3. Watch all tenant pods roll outfor ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2); do echo "=== $ns ===" kubectl rollout status deployment -n "$ns" --timeout=120s 2>/dev/null || echo " (no deployment or timeout)"done
# 4. Spot-check health on a few tenantsfor ns in $(kubectl get ns -o name | grep tenant- | cut -d/ -f2 | head -3); do pod=$(kubectl get pods -n "$ns" -l app.kubernetes.io/component=server -o name | head -1) if [ -n "$pod" ]; then echo "$ns: $(kubectl exec -n "$ns" "$pod" -- wget -qO- http://localhost:9123/health 2>/dev/null)" fidoneRolling upgrades
Upgrade tenants one at a time if you want to catch issues before they affect everyone. The loop above upgrades all tenants in sequence. For large deployments, consider adding a health check between upgrades and aborting on failure.
Rotate a tenant’s master key (multi-tenant cluster)
Trigger: A tenant’s master key may be compromised, or periodic key rotation is required.
Secrets re-encryption required
Changing the master key means existing encrypted secrets become unreadable. You must re-encrypt them with the new key.
Steps:
# 1. Generate a new master keyNEW_KEY=$(openssl rand -hex 32)
# 2. Export existing secrets while the old key is still activekubectl exec -n tenant-acme deploy/acme-ironflow -- ironflow secret list# Note all secret names, then export their values
# 3. Upgrade with the new master keyhelm upgrade acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ --reuse-values \ --set ironflow.masterKey=$NEW_KEY
# 4. Wait for the rolloutkubectl rollout status deployment/acme-ironflow -n tenant-acme
# 5. Re-set all secrets with the new key# (Ironflow encrypts with the new master key on write)kubectl exec -n tenant-acme deploy/acme-ironflow -- ironflow secret set MY_SECRET "old-value"# Repeat for each secret