Disaster Recovery Scenarios
Full cluster rebuild from scratch
Trigger: Cluster irrecoverably broken. Need to rebuild everything from infrastructure up.
Steps:
# 1. Destroy the old cluster (if using Hetzner provisioning)ironflow provision destroy --provider hetzner --name ironflow# This removes all VMs, load balancers, and volumes. Data is gone.
# 2. Provision new infrastructureironflow provision create --provider hetzner --template medium --name ironflow# ~5-8 minutes. Creates VMs, installs Talos Linux, configures networking.# Kubeconfig saved to ~/.kube/clusters/hetzner-ironflow.yaml
export KUBECONFIG=~/.kube/clusters/hetzner-ironflow.yaml
# 3. Install the CNPG operator and Barman Cloud Pluginkubectl apply --server-side -f \ https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.2.yamlkubectl wait --for=condition=Available deployment/cnpg-controller-manager \ -n cnpg-system --timeout=120s
# Install the Barman Cloud Plugin (required for object store backups)kubectl apply -f \ https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.11.0/manifest.yaml
# 4. Deploy Ironflow with your templateironflow deploy --template medium --name my-release# Or: ironflow deploy --template small --name dev# Or: ironflow deploy --template large --name prod \# --set externalDatabase.url=postgres://... \# --set externalNats.url=nats://...
# 5. Verify the deploymentironflow deploy status --name my-releasekubectl get pods -n ironflowRestoring data from a PostgreSQL backup (if available):
# 6. If you have a CNPG backup (VolumeSnapshot or object store), restore it# before Ironflow starts writing new data.# See "Restore PostgreSQL from CNPG backup" below for detailed steps.
# 7. After restore, restart Ironflow to pick up the restored datakubectl rollout restart deployment/ironflow -n ironflowWhat recovers automatically vs. what needs restoring:
| Component | Recovery |
|---|---|
| NATS JetStream streams | Auto-created by Ironflow on startup. No restore needed. |
| NATS KV buckets (config, secrets, cron) | Auto-created on first use. Secrets and config values are lost unless backed up. |
| PostgreSQL schema | Auto-migrated by Ironflow on startup. |
| PostgreSQL data (runs, events, functions, projections) | Requires backup restore. Without a backup, historical data is lost. |
| Ironflow master encryption key | Must be recreated as a Kubernetes secret before deploying. |
| S3 backup credentials | Must be recreated as a Kubernetes secret before deploying. |
Recreate secrets before deploying:
kubectl create namespace ironflow
# Master encryption key (required for secrets encryption)kubectl create secret generic ironflow-master-key -n ironflow \ --from-literal=master-key="$(openssl rand -hex 32)"# If you saved the original master key, use that value instead — otherwise# previously encrypted secrets cannot be decrypted.
# S3 backup credentials (required for PostgreSQL backups)kubectl create secret generic ironflow-s3-creds -n ironflow \ --from-literal=ACCESS_KEY_ID="your-s3-access-key" \ --from-literal=SECRET_ACCESS_KEY="your-s3-secret-key"Restore PostgreSQL from CNPG backup
Trigger: Data loss or corruption. Need to restore from a CNPG backup.
Backup configuration required
Small and Medium templates include daily S3 backups by default (method: plugin). If you disabled backups or use custom values without backup configuration, there is no backup to restore from. See the post-recovery note below for how to configure backups.
The recommended approach uses Helm with a recovery values override file. This preserves all your Cluster CR settings (resources, pooler, monitoring, parameters) during recovery.
Steps:
# 1. Check if the ObjectStore and backups existkubectl get objectstores.barmancloud.cnpg.io -n ironflowkubectl get backups -n ironflow
# 2. Uninstall the Helm release (keeps namespace and secrets)RELEASE_NAME=dev # your Helm release namehelm uninstall $RELEASE_NAME -n ironflow
# 3. Wait for pods to terminate, then delete leftover PVCskubectl get pods -n ironflow -wkubectl delete pvc -n ironflow -l cnpg.io/cluster=$RELEASE_NAME-ironflow-postgresql
# 4. Create a recovery values overridecat <<EOF > /tmp/recovery.yamlpostgresql: recovery: source: origin externalClusters: - name: origin plugin: name: barman-cloud.cloudnative-pg.io parameters: barmanObjectName: ${RELEASE_NAME}-ironflow-objectstore serverName: ${RELEASE_NAME}-ironflow-postgresqlEOF
# 5. Reinstall with recoveryhelm install $RELEASE_NAME ./deploy/helm/ironflow \ -n ironflow \ -f deploy/helm/ironflow/values-medium.yaml \ -f /tmp/recovery.yaml
# 6. Wait for recovery to completekubectl get cluster -n ironflow -w# Wait for STATUS = "Cluster in healthy state"
# 7. Verify and clean upkubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20rm /tmp/recovery.yamlPoint-in-time recovery (PITR): Add recoveryTarget to restore to a specific timestamp:
postgresql: recovery: source: origin recoveryTarget: targetTime: "2026-04-05T10:00:00Z" externalClusters: - name: origin plugin: name: barman-cloud.cloudnative-pg.io parameters: barmanObjectName: dev-ironflow-objectstore serverName: dev-ironflow-postgresql# 1. Check if VolumeSnapshots existkubectl get volumesnapshots -n ironflow -l cnpg.io/cluster=dev-ironflow-postgresql
# 2. Uninstall the Helm releaseRELEASE_NAME=devhelm uninstall $RELEASE_NAME -n ironflowkubectl get pods -n ironflow -wkubectl delete pvc -n ironflow -l cnpg.io/cluster=$RELEASE_NAME-ironflow-postgresql
# 3. Create a recovery values overrideSNAPSHOT_NAME=<snapshot-name> # from step 1cat <<EOF > /tmp/recovery.yamlpostgresql: recovery: volumeSnapshots: storage: name: $SNAPSHOT_NAME kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.ioEOF
# 4. Reinstall with recoveryhelm install $RELEASE_NAME ./deploy/helm/ironflow \ -n ironflow \ -f deploy/helm/ironflow/values-medium.yaml \ -f /tmp/recovery.yaml
# 5. Wait for recovery and verifykubectl get cluster -n ironflow -wkubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20rm /tmp/recovery.yamlConfiguring CNPG backups for production (do this before you need a restore):
Small and Medium templates include daily S3 backups by default. The S3 destination path is auto-derived from the Helm release name (s3://ironflow-backups/<release-name>). You only need to create the S3 credentials secret and set the endpoint URL.
For VolumeSnapshot-based backups instead of S3:
# In your values file:postgresql: scheduledBackups: enabled: true schedule: "0 0 2 * * *" # Daily at 2 AM (6-field cron with seconds) method: volumeSnapshot volumeSnapshotClassName: "hcloud-volumes" # Your storage provider's class snapshotRetention: enabled: true days: 7For S3 object store backups with continuous WAL archiving (enables PITR):
postgresql: objectStore: enabled: true # destinationPath auto-derived: s3://ironflow-backups/<release-name> # Override: --set postgresql.objectStore.destinationPath=s3://my-bucket/custom-path endpointURL: "https://fsn1.your-objectstorage.com" retentionPolicy: "7d" s3Credentials: accessKeyId: name: ironflow-s3-creds key: ACCESS_KEY_ID secretAccessKey: name: ironflow-s3-creds key: SECRET_ACCESS_KEYVerify backups are running:
kubectl get scheduledbackups -n <namespace>kubectl get backups -n <namespace> --sort-by=.metadata.creationTimestampBackup retention
All Helm value templates (default, small, medium, multi-tenant) use a 7-day retention policy for S3 object store backups. CNPG’s Barman Cloud Plugin automatically prunes base backups and WAL segments older than the retention window.
Combined with the default daily ScheduledBackup (runs at 2 AM UTC), this means approximately 7 daily backups exist in your S3 bucket at any point. PITR is available to any point within the 7-day window thanks to continuous WAL archiving.
To change the retention period, override postgresql.objectStore.retentionPolicy in your values file or at install/upgrade time:
# Keep 30 days of backupshelm upgrade <release> deploy/helm/ironflow \ -n ironflow \ --reuse-values \ --set postgresql.objectStore.retentionPolicy="30d"Or in a values file:
postgresql: objectStore: retentionPolicy: "30d"For VolumeSnapshot-based backups, retention is controlled separately by the cleanup CronJob:
postgresql: snapshotRetention: enabled: true days: 7 # Delete snapshots older than 7 days schedule: "0 3 * * *" # Run cleanup daily at 3 AMChoosing a retention period
Longer retention increases S3 storage costs but gives you a wider PITR window. For demo clusters, 7 days is sufficient. For production, consider 30 days or longer depending on compliance requirements.
Recover from accidental namespace deletion
Trigger: Someone ran kubectl delete namespace ironflow. All Ironflow resources in that namespace are gone.
Steps:
# 1. Recreate the namespacekubectl create namespace ironflow
# 2. Recreate secrets that were in the namespace# Image pull secret (if using private registry)kubectl create secret docker-registry ghcr-pull-secret \ --namespace ironflow \ --docker-server=ghcr.io \ --docker-username=YOUR_USERNAME \ --docker-password=YOUR_TOKEN
# Master encryption key (for secrets management)# Use your saved key — if lost, previously encrypted secrets are unrecoverablekubectl create secret generic ironflow-master-key -n ironflow \ --from-literal=master-key="YOUR_SAVED_MASTER_KEY"
# 3. Redeploy Ironflowironflow deploy --template medium --name my-release# Or with Helm: helm install ironflow deploy/helm/ironflow/ -n ironflow -f your-values.yaml
# 4. Verify pods are startingironflow deploy status --name my-releasekubectl get pods -n ironflow -w
# 5. Restore PostgreSQL data from backup (if available)# See "Restore PostgreSQL from CNPG backup" above for detailed steps.# If using bundled PostgreSQL (Small/Medium), it was in the ironflow namespace# and is now gone — you need a backup to recover data.
# 6. Verify NATS streams are recreated# Ironflow auto-creates JetStream streams on startup — check logs:kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=30 | grep -i "stream\|jetstream\|nats"If ironflow-system was also deleted (Large template or Hetzner bootstrap):
# Infrastructure namespace is gone — NATS and PostgreSQL need reinstalling
# 1. Recreate the infrastructure namespacekubectl create namespace ironflow-system
# 2. Reinstall NATShelm repo add nats https://nats-io.github.io/k8s/helm/charts/helm install nats nats/nats -n ironflow-system -f your-nats-values.yaml
# 3. Reinstall CNPG operator (if not in cnpg-system, it should still be there)kubectl get deployment cnpg-controller-manager -n cnpg-system# If missing:kubectl apply --server-side -f \ https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.2.yaml# Also install the Barman Cloud Plugin if missing:kubectl apply -f \ https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.11.0/manifest.yaml
# 4. Recreate the PostgreSQL clusterkubectl apply -f your-cnpg-cluster.yaml -n ironflow-systemkubectl get cluster -n ironflow-system -w# Wait for "Cluster in healthy state"
# 5. Then follow steps 1-6 above to recreate the ironflow namespace and deployWhat is recoverable vs. lost:
| Scenario | PostgreSQL data | NATS streams | Secrets/Config |
|---|---|---|---|
ironflow namespace deleted, ironflow-system intact | Intact (PG is in ironflow-system) | Intact (NATS is in ironflow-system) | KV values intact |
ironflow namespace deleted, bundled (Small/Medium) | Lost without backup | Lost (recreated empty on startup) | Lost (recreated empty) |
| Both namespaces deleted | Lost without backup | Lost (recreated empty on startup) | Lost (recreated empty) |
Prevention: Use RBAC to restrict kubectl delete namespace permissions. Consider namespace finalizers or admission webhooks to block accidental deletion in production.
Restore a single tenant from backup (multi-tenant cluster)
Trigger: One tenant’s data is lost or corrupted and needs to be restored from a CNPG backup without affecting other tenants.
Two recovery methods are available:
Use this method when backups are stored in S3 via the Barman Cloud Plugin (the default for multi-tenant deployments). Object store backups survive cluster deletion, so you can restore even after helm uninstall.
# 1. Identify the tenant's CNPG cluster and ObjectStorekubectl get cluster -n tenant-acmekubectl get objectstores.barmancloud.cnpg.io -n tenant-acme
# 2. Uninstall the tenant's Helm release (keeps the namespace and secrets)helm uninstall acme -n tenant-acme
# 3. Wait for all pods to terminatekubectl get pods -n tenant-acme -w
# 4. Delete any leftover PVCs from the old CNPG clusterkubectl delete pvc -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql
# 5. Reinstall with recovery from the object store# Create a temporary values override for plugin-based recovery.# The serverName must match the original CNPG Cluster name.cat <<'EOF' > /tmp/tenant-acme-recovery.yamlpostgresql: recovery: source: origin externalClusters: - name: origin plugin: name: barman-cloud.cloudnative-pg.io parameters: barmanObjectName: acme-ironflow-objectstore serverName: acme-ironflow-postgresqlEOF
helm install acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ -f /tmp/tenant-acme-recovery.yaml \ --set ingress.host=acme.ironflow.example.com \ --set ironflow.masterKey=$ACME_MASTER_KEY # Use the same master key as before
# 6. Wait for recovery to completekubectl get cluster -n tenant-acme -w# Wait for STATUS = "Cluster in healthy state"
# 7. Verify tenant healthkubectl exec -n tenant-acme deploy/acme-ironflow -- wget -qO- http://localhost:9123/health
# 8. Clean up temporary filerm /tmp/tenant-acme-recovery.yamlPoint-in-time recovery (PITR): To restore to a specific timestamp (e.g., just before data corruption), add a recoveryTarget to the recovery values:
postgresql: recovery: source: origin recoveryTarget: targetTime: "2026-04-05T10:00:00Z" # Restore to this point in time externalClusters: - name: origin plugin: name: barman-cloud.cloudnative-pg.io parameters: barmanObjectName: acme-ironflow-objectstore serverName: acme-ironflow-postgresqlUse this method when backups are VolumeSnapshots (available if scheduledBackups.method=volumeSnapshot).
# 1. Check available VolumeSnapshotskubectl get volumesnapshots -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql
# 2. Uninstall the tenant's Helm releasehelm uninstall acme -n tenant-acmekubectl get pods -n tenant-acme -w
# 3. Delete leftover PVCskubectl delete pvc -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql
# 4. Reinstall with recovery from VolumeSnapshotcat <<'EOF' > /tmp/tenant-acme-recovery.yamlpostgresql: recovery: volumeSnapshots: storage: name: acme-ironflow-postgresql-20260405020000 # Replace with actual snapshot name kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.ioEOF
helm install acme ./deploy/helm/ironflow \ -n tenant-acme \ -f deploy/helm/ironflow/values-multi-tenant.yaml \ -f /tmp/tenant-acme-recovery.yaml \ --set ingress.host=acme.ironflow.example.com \ --set ironflow.masterKey=$ACME_MASTER_KEY
# 5. Wait for recovery and verifykubectl get cluster -n tenant-acme -wkubectl exec -n tenant-acme deploy/acme-ironflow -- wget -qO- http://localhost:9123/health
# 6. Clean uprm /tmp/tenant-acme-recovery.yamlOther tenants are completely unaffected — each tenant’s CNPG cluster and backups are scoped to their own namespace.
Master key
You must use the same master key the tenant was originally deployed with. If you lost it, any encrypted secrets in the database will be unrecoverable. Store master keys securely outside the cluster (e.g., in a password manager or external secret store).
Finding the right names
The barmanObjectName and serverName in the recovery values must match the original deployment. For a Helm release named acme with default chart settings:
- barmanObjectName:
acme-ironflow-objectstore - serverName:
acme-ironflow-postgresql
You can verify these by checking: helm get values acme -n tenant-acme