Disaster Recovery Scenarios

Full cluster rebuild from scratch

Trigger: Cluster irrecoverably broken. Need to rebuild everything from infrastructure up.

Steps:

# 1. Destroy the old cluster (if using Hetzner provisioning)
ironflow provision destroy --provider hetzner --name ironflow
# This removes all VMs, load balancers, and volumes. Data is gone.

# 2. Provision new infrastructure
ironflow provision create --provider hetzner --template medium --name ironflow
# ~5-8 minutes. Creates VMs, installs Talos Linux, configures networking.
# Kubeconfig saved to ~/.kube/clusters/hetzner-ironflow.yaml

export KUBECONFIG=~/.kube/clusters/hetzner-ironflow.yaml

# 3. Install the CNPG operator and Barman Cloud Plugin
kubectl apply --server-side -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.2.yaml
kubectl wait --for=condition=Available deployment/cnpg-controller-manager \
  -n cnpg-system --timeout=120s

# Install the Barman Cloud Plugin (required for object store backups)
kubectl apply -f \
  https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.11.0/manifest.yaml

# 4. Deploy Ironflow with your template
ironflow deploy --template medium --name my-release
# Or: ironflow deploy --template small --name dev
# Or: ironflow deploy --template large --name prod \
#   --set externalDatabase.url=postgres://... \
#   --set externalNats.url=nats://...

# 5. Verify the deployment
ironflow deploy status --name my-release
kubectl get pods -n ironflow

Restoring data from a PostgreSQL backup (if available):

# 6. If you have a CNPG backup (VolumeSnapshot or object store), restore it
#    before Ironflow starts writing new data.
#    See "Restore PostgreSQL from CNPG backup" below for detailed steps.

# 7. After restore, restart Ironflow to pick up the restored data
kubectl rollout restart deployment/ironflow -n ironflow

What recovers automatically vs. what needs restoring:

Component	Recovery
NATS JetStream streams	Auto-created by Ironflow on startup. No restore needed.
NATS KV buckets (config, secrets, cron)	Auto-created on first use. Secrets and config values are lost unless backed up.
PostgreSQL schema	Auto-migrated by Ironflow on startup.
PostgreSQL data (runs, events, functions, projections)	Requires backup restore. Without a backup, historical data is lost.
Ironflow master encryption key	Must be recreated as a Kubernetes secret before deploying.
S3 backup credentials	Must be recreated as a Kubernetes secret before deploying.

Recreate secrets before deploying:

kubectl create namespace ironflow

# Master encryption key (required for secrets encryption)
kubectl create secret generic ironflow-master-key -n ironflow \
  --from-literal=master-key="$(openssl rand -hex 32)"
# If you saved the original master key, use that value instead — otherwise
# previously encrypted secrets cannot be decrypted.

# S3 backup credentials (required for PostgreSQL backups)
kubectl create secret generic ironflow-s3-creds -n ironflow \
  --from-literal=ACCESS_KEY_ID="your-s3-access-key" \
  --from-literal=SECRET_ACCESS_KEY="your-s3-secret-key"

Restore PostgreSQL from CNPG backup

Trigger: Data loss or corruption. Need to restore from a CNPG backup.

Backup configuration required

Small and Medium templates include daily S3 backups by default (method: plugin). If you disabled backups or use custom values without backup configuration, there is no backup to restore from. See the post-recovery note below for how to configure backups.

The recommended approach uses Helm with a recovery values override file. This preserves all your Cluster CR settings (resources, pooler, monitoring, parameters) during recovery.

Steps:

Object Store (S3) — Small/Medium default
VolumeSnapshot

# 1. Check if the ObjectStore and backups exist
kubectl get objectstores.barmancloud.cnpg.io -n ironflow
kubectl get backups -n ironflow

# 2. Uninstall the Helm release (keeps namespace and secrets)
RELEASE_NAME=dev  # your Helm release name
helm uninstall $RELEASE_NAME -n ironflow

# 3. Wait for pods to terminate, then delete leftover PVCs
kubectl get pods -n ironflow -w
kubectl delete pvc -n ironflow -l cnpg.io/cluster=$RELEASE_NAME-ironflow-postgresql

# 4. Create a recovery values override
cat <<EOF > /tmp/recovery.yaml
postgresql:
  recovery:
    source: origin
    externalClusters:
      - name: origin
        plugin:
          name: barman-cloud.cloudnative-pg.io
          parameters:
            barmanObjectName: ${RELEASE_NAME}-ironflow-objectstore
            serverName: ${RELEASE_NAME}-ironflow-postgresql
EOF

# 5. Reinstall with recovery
helm install $RELEASE_NAME ./deploy/helm/ironflow \
  -n ironflow \
  -f deploy/helm/ironflow/values-medium.yaml \
  -f /tmp/recovery.yaml

# 6. Wait for recovery to complete
kubectl get cluster -n ironflow -w
# Wait for STATUS = "Cluster in healthy state"

# 7. Verify and clean up
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20
rm /tmp/recovery.yaml

Point-in-time recovery (PITR): Add recoveryTarget to restore to a specific timestamp:

postgresql:
  recovery:
    source: origin
    recoveryTarget:
      targetTime: "2026-04-05T10:00:00Z"
    externalClusters:
      - name: origin
        plugin:
          name: barman-cloud.cloudnative-pg.io
          parameters:
            barmanObjectName: dev-ironflow-objectstore
            serverName: dev-ironflow-postgresql

# 1. Check if VolumeSnapshots exist
kubectl get volumesnapshots -n ironflow -l cnpg.io/cluster=dev-ironflow-postgresql

# 2. Uninstall the Helm release
RELEASE_NAME=dev
helm uninstall $RELEASE_NAME -n ironflow
kubectl get pods -n ironflow -w
kubectl delete pvc -n ironflow -l cnpg.io/cluster=$RELEASE_NAME-ironflow-postgresql

# 3. Create a recovery values override
SNAPSHOT_NAME=<snapshot-name>  # from step 1
cat <<EOF > /tmp/recovery.yaml
postgresql:
  recovery:
    volumeSnapshots:
      storage:
        name: $SNAPSHOT_NAME
        kind: VolumeSnapshot
        apiGroup: snapshot.storage.k8s.io
EOF

# 4. Reinstall with recovery
helm install $RELEASE_NAME ./deploy/helm/ironflow \
  -n ironflow \
  -f deploy/helm/ironflow/values-medium.yaml \
  -f /tmp/recovery.yaml

# 5. Wait for recovery and verify
kubectl get cluster -n ironflow -w
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20
rm /tmp/recovery.yaml

Configuring CNPG backups for production (do this before you need a restore):

Small and Medium templates include daily S3 backups by default. The S3 destination path is auto-derived from the Helm release name (s3://ironflow-backups/<release-name>). You only need to create the S3 credentials secret and set the endpoint URL.

For VolumeSnapshot-based backups instead of S3:

# In your values file:
postgresql:
  scheduledBackups:
    enabled: true
    schedule: "0 0 2 * * *"       # Daily at 2 AM (6-field cron with seconds)
    method: volumeSnapshot
    volumeSnapshotClassName: "hcloud-volumes"  # Your storage provider's class
  snapshotRetention:
    enabled: true
    days: 7

For S3 object store backups with continuous WAL archiving (enables PITR):

postgresql:
  objectStore:
    enabled: true
    # destinationPath auto-derived: s3://ironflow-backups/<release-name>
    # Override: --set postgresql.objectStore.destinationPath=s3://my-bucket/custom-path
    endpointURL: "https://fsn1.your-objectstorage.com"
    retentionPolicy: "7d"
    s3Credentials:
      accessKeyId:
        name: ironflow-s3-creds
        key: ACCESS_KEY_ID
      secretAccessKey:
        name: ironflow-s3-creds
        key: SECRET_ACCESS_KEY

Verify backups are running:

kubectl get scheduledbackups -n <namespace>
kubectl get backups -n <namespace> --sort-by=.metadata.creationTimestamp

Backup retention

All Helm value templates (default, small, medium, multi-tenant) use a 7-day retention policy for S3 object store backups. CNPG’s Barman Cloud Plugin automatically prunes base backups and WAL segments older than the retention window.

Combined with the default daily ScheduledBackup (runs at 2 AM UTC), this means approximately 7 daily backups exist in your S3 bucket at any point. PITR is available to any point within the 7-day window thanks to continuous WAL archiving.

To change the retention period, override postgresql.objectStore.retentionPolicy in your values file or at install/upgrade time:

# Keep 30 days of backups
helm upgrade <release> deploy/helm/ironflow \
  -n ironflow \
  --reuse-values \
  --set postgresql.objectStore.retentionPolicy="30d"

Or in a values file:

postgresql:
  objectStore:
    retentionPolicy: "30d"

For VolumeSnapshot-based backups, retention is controlled separately by the cleanup CronJob:

postgresql:
  snapshotRetention:
    enabled: true
    days: 7           # Delete snapshots older than 7 days
    schedule: "0 3 * * *"  # Run cleanup daily at 3 AM

Choosing a retention period

Longer retention increases S3 storage costs but gives you a wider PITR window. For demo clusters, 7 days is sufficient. For production, consider 30 days or longer depending on compliance requirements.

Recover from accidental namespace deletion

Trigger: Someone ran kubectl delete namespace ironflow. All Ironflow resources in that namespace are gone.

Steps:

# 1. Recreate the namespace
kubectl create namespace ironflow

# 2. Recreate secrets that were in the namespace
# Image pull secret (if using private registry)
kubectl create secret docker-registry ghcr-pull-secret \
  --namespace ironflow \
  --docker-server=ghcr.io \
  --docker-username=YOUR_USERNAME \
  --docker-password=YOUR_TOKEN

# Master encryption key (for secrets management)
# Use your saved key — if lost, previously encrypted secrets are unrecoverable
kubectl create secret generic ironflow-master-key -n ironflow \
  --from-literal=master-key="YOUR_SAVED_MASTER_KEY"

# 3. Redeploy Ironflow
ironflow deploy --template medium --name my-release
# Or with Helm: helm install ironflow deploy/helm/ironflow/ -n ironflow -f your-values.yaml

# 4. Verify pods are starting
ironflow deploy status --name my-release
kubectl get pods -n ironflow -w

# 5. Restore PostgreSQL data from backup (if available)
# See "Restore PostgreSQL from CNPG backup" above for detailed steps.
# If using bundled PostgreSQL (Small/Medium), it was in the ironflow namespace
# and is now gone — you need a backup to recover data.

# 6. Verify NATS streams are recreated
# Ironflow auto-creates JetStream streams on startup — check logs:
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=30 | grep -i "stream\|jetstream\|nats"

If ironflow-system was also deleted (Large template or Hetzner bootstrap):

# Infrastructure namespace is gone — NATS and PostgreSQL need reinstalling

# 1. Recreate the infrastructure namespace
kubectl create namespace ironflow-system

# 2. Reinstall NATS
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm install nats nats/nats -n ironflow-system -f your-nats-values.yaml

# 3. Reinstall CNPG operator (if not in cnpg-system, it should still be there)
kubectl get deployment cnpg-controller-manager -n cnpg-system
# If missing:
kubectl apply --server-side -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.2.yaml
# Also install the Barman Cloud Plugin if missing:
kubectl apply -f \
  https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.11.0/manifest.yaml

# 4. Recreate the PostgreSQL cluster
kubectl apply -f your-cnpg-cluster.yaml -n ironflow-system
kubectl get cluster -n ironflow-system -w
# Wait for "Cluster in healthy state"

# 5. Then follow steps 1-6 above to recreate the ironflow namespace and deploy

What is recoverable vs. lost:

Scenario	PostgreSQL data	NATS streams	Secrets/Config
`ironflow` namespace deleted, `ironflow-system` intact	Intact (PG is in `ironflow-system`)	Intact (NATS is in `ironflow-system`)	KV values intact
`ironflow` namespace deleted, bundled (Small/Medium)	Lost without backup	Lost (recreated empty on startup)	Lost (recreated empty)
Both namespaces deleted	Lost without backup	Lost (recreated empty on startup)	Lost (recreated empty)

Prevention: Use RBAC to restrict kubectl delete namespace permissions. Consider namespace finalizers or admission webhooks to block accidental deletion in production.

Restore a single tenant from backup (multi-tenant cluster)

Trigger: One tenant’s data is lost or corrupted and needs to be restored from a CNPG backup without affecting other tenants.

Two recovery methods are available:

Object Store (S3/GCS/Azure)
VolumeSnapshot

Use this method when backups are stored in S3 via the Barman Cloud Plugin (the default for multi-tenant deployments). Object store backups survive cluster deletion, so you can restore even after helm uninstall.

# 1. Identify the tenant's CNPG cluster and ObjectStore
kubectl get cluster -n tenant-acme
kubectl get objectstores.barmancloud.cnpg.io -n tenant-acme

# 2. Uninstall the tenant's Helm release (keeps the namespace and secrets)
helm uninstall acme -n tenant-acme

# 3. Wait for all pods to terminate
kubectl get pods -n tenant-acme -w

# 4. Delete any leftover PVCs from the old CNPG cluster
kubectl delete pvc -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql

# 5. Reinstall with recovery from the object store
#    Create a temporary values override for plugin-based recovery.
#    The serverName must match the original CNPG Cluster name.
cat <<'EOF' > /tmp/tenant-acme-recovery.yaml
postgresql:
  recovery:
    source: origin
    externalClusters:
      - name: origin
        plugin:
          name: barman-cloud.cloudnative-pg.io
          parameters:
            barmanObjectName: acme-ironflow-objectstore
            serverName: acme-ironflow-postgresql
EOF

helm install acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  -f /tmp/tenant-acme-recovery.yaml \
  --set ingress.host=acme.ironflow.example.com \
  --set ironflow.masterKey=$ACME_MASTER_KEY   # Use the same master key as before

# 6. Wait for recovery to complete
kubectl get cluster -n tenant-acme -w
# Wait for STATUS = "Cluster in healthy state"

# 7. Verify tenant health
kubectl exec -n tenant-acme deploy/acme-ironflow -- wget -qO- http://localhost:9123/health

# 8. Clean up temporary file
rm /tmp/tenant-acme-recovery.yaml

Point-in-time recovery (PITR): To restore to a specific timestamp (e.g., just before data corruption), add a recoveryTarget to the recovery values:

postgresql:
  recovery:
    source: origin
    recoveryTarget:
      targetTime: "2026-04-05T10:00:00Z"   # Restore to this point in time
    externalClusters:
      - name: origin
        plugin:
          name: barman-cloud.cloudnative-pg.io
          parameters:
            barmanObjectName: acme-ironflow-objectstore
            serverName: acme-ironflow-postgresql

Use this method when backups are VolumeSnapshots (available if scheduledBackups.method=volumeSnapshot).

# 1. Check available VolumeSnapshots
kubectl get volumesnapshots -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql

# 2. Uninstall the tenant's Helm release
helm uninstall acme -n tenant-acme
kubectl get pods -n tenant-acme -w

# 3. Delete leftover PVCs
kubectl delete pvc -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql

# 4. Reinstall with recovery from VolumeSnapshot
cat <<'EOF' > /tmp/tenant-acme-recovery.yaml
postgresql:
  recovery:
    volumeSnapshots:
      storage:
        name: acme-ironflow-postgresql-20260405020000   # Replace with actual snapshot name
        kind: VolumeSnapshot
        apiGroup: snapshot.storage.k8s.io
EOF

helm install acme ./deploy/helm/ironflow \
  -n tenant-acme \
  -f deploy/helm/ironflow/values-multi-tenant.yaml \
  -f /tmp/tenant-acme-recovery.yaml \
  --set ingress.host=acme.ironflow.example.com \
  --set ironflow.masterKey=$ACME_MASTER_KEY

# 5. Wait for recovery and verify
kubectl get cluster -n tenant-acme -w
kubectl exec -n tenant-acme deploy/acme-ironflow -- wget -qO- http://localhost:9123/health

# 6. Clean up
rm /tmp/tenant-acme-recovery.yaml

Other tenants are completely unaffected — each tenant’s CNPG cluster and backups are scoped to their own namespace.

Master key

You must use the same master key the tenant was originally deployed with. If you lost it, any encrypted secrets in the database will be unrecoverable. Store master keys securely outside the cluster (e.g., in a password manager or external secret store).

Finding the right names

The barmanObjectName and serverName in the recovery values must match the original deployment. For a Helm release named acme with default chart settings:

barmanObjectName: acme-ironflow-objectstore
serverName: acme-ironflow-postgresql

You can verify these by checking: helm get values acme -n tenant-acme