Skip to content

Disaster Recovery Scenarios

Full cluster rebuild from scratch

Trigger: Cluster irrecoverably broken. Need to rebuild everything from infrastructure up.

Steps:

Terminal window
# 1. Destroy the old cluster (if using Hetzner provisioning)
ironflow provision destroy --provider hetzner --name ironflow
# This removes all VMs, load balancers, and volumes. Data is gone.
# 2. Provision new infrastructure
ironflow provision create --provider hetzner --template medium --name ironflow
# ~5-8 minutes. Creates VMs, installs Talos Linux, configures networking.
# Kubeconfig saved to ~/.kube/clusters/hetzner-ironflow.yaml
export KUBECONFIG=~/.kube/clusters/hetzner-ironflow.yaml
# 3. Install the CNPG operator and Barman Cloud Plugin
kubectl apply --server-side -f \
https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.2.yaml
kubectl wait --for=condition=Available deployment/cnpg-controller-manager \
-n cnpg-system --timeout=120s
# Install the Barman Cloud Plugin (required for object store backups)
kubectl apply -f \
https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.11.0/manifest.yaml
# 4. Deploy Ironflow with your template
ironflow deploy --template medium --name my-release
# Or: ironflow deploy --template small --name dev
# Or: ironflow deploy --template large --name prod \
# --set externalDatabase.url=postgres://... \
# --set externalNats.url=nats://...
# 5. Verify the deployment
ironflow deploy status --name my-release
kubectl get pods -n ironflow

Restoring data from a PostgreSQL backup (if available):

Terminal window
# 6. If you have a CNPG backup (VolumeSnapshot or object store), restore it
# before Ironflow starts writing new data.
# See "Restore PostgreSQL from CNPG backup" below for detailed steps.
# 7. After restore, restart Ironflow to pick up the restored data
kubectl rollout restart deployment/ironflow -n ironflow

What recovers automatically vs. what needs restoring:

ComponentRecovery
NATS JetStream streamsAuto-created by Ironflow on startup. No restore needed.
NATS KV buckets (config, secrets, cron)Auto-created on first use. Secrets and config values are lost unless backed up.
PostgreSQL schemaAuto-migrated by Ironflow on startup.
PostgreSQL data (runs, events, functions, projections)Requires backup restore. Without a backup, historical data is lost.
Ironflow master encryption keyMust be recreated as a Kubernetes secret before deploying.
S3 backup credentialsMust be recreated as a Kubernetes secret before deploying.

Recreate secrets before deploying:

Terminal window
kubectl create namespace ironflow
# Master encryption key (required for secrets encryption)
kubectl create secret generic ironflow-master-key -n ironflow \
--from-literal=master-key="$(openssl rand -hex 32)"
# If you saved the original master key, use that value instead — otherwise
# previously encrypted secrets cannot be decrypted.
# S3 backup credentials (required for PostgreSQL backups)
kubectl create secret generic ironflow-s3-creds -n ironflow \
--from-literal=ACCESS_KEY_ID="your-s3-access-key" \
--from-literal=SECRET_ACCESS_KEY="your-s3-secret-key"

Restore PostgreSQL from CNPG backup

Trigger: Data loss or corruption. Need to restore from a CNPG backup.

Backup configuration required

Small and Medium templates include daily S3 backups by default (method: plugin). If you disabled backups or use custom values without backup configuration, there is no backup to restore from. See the post-recovery note below for how to configure backups.

The recommended approach uses Helm with a recovery values override file. This preserves all your Cluster CR settings (resources, pooler, monitoring, parameters) during recovery.

Steps:

Terminal window
# 1. Check if the ObjectStore and backups exist
kubectl get objectstores.barmancloud.cnpg.io -n ironflow
kubectl get backups -n ironflow
# 2. Uninstall the Helm release (keeps namespace and secrets)
RELEASE_NAME=dev # your Helm release name
helm uninstall $RELEASE_NAME -n ironflow
# 3. Wait for pods to terminate, then delete leftover PVCs
kubectl get pods -n ironflow -w
kubectl delete pvc -n ironflow -l cnpg.io/cluster=$RELEASE_NAME-ironflow-postgresql
# 4. Create a recovery values override
cat <<EOF > /tmp/recovery.yaml
postgresql:
recovery:
source: origin
externalClusters:
- name: origin
plugin:
name: barman-cloud.cloudnative-pg.io
parameters:
barmanObjectName: ${RELEASE_NAME}-ironflow-objectstore
serverName: ${RELEASE_NAME}-ironflow-postgresql
EOF
# 5. Reinstall with recovery
helm install $RELEASE_NAME ./deploy/helm/ironflow \
-n ironflow \
-f deploy/helm/ironflow/values-medium.yaml \
-f /tmp/recovery.yaml
# 6. Wait for recovery to complete
kubectl get cluster -n ironflow -w
# Wait for STATUS = "Cluster in healthy state"
# 7. Verify and clean up
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=20
rm /tmp/recovery.yaml

Point-in-time recovery (PITR): Add recoveryTarget to restore to a specific timestamp:

postgresql:
recovery:
source: origin
recoveryTarget:
targetTime: "2026-04-05T10:00:00Z"
externalClusters:
- name: origin
plugin:
name: barman-cloud.cloudnative-pg.io
parameters:
barmanObjectName: dev-ironflow-objectstore
serverName: dev-ironflow-postgresql

Configuring CNPG backups for production (do this before you need a restore):

Small and Medium templates include daily S3 backups by default. The S3 destination path is auto-derived from the Helm release name (s3://ironflow-backups/<release-name>). You only need to create the S3 credentials secret and set the endpoint URL.

For VolumeSnapshot-based backups instead of S3:

# In your values file:
postgresql:
scheduledBackups:
enabled: true
schedule: "0 0 2 * * *" # Daily at 2 AM (6-field cron with seconds)
method: volumeSnapshot
volumeSnapshotClassName: "hcloud-volumes" # Your storage provider's class
snapshotRetention:
enabled: true
days: 7

For S3 object store backups with continuous WAL archiving (enables PITR):

postgresql:
objectStore:
enabled: true
# destinationPath auto-derived: s3://ironflow-backups/<release-name>
# Override: --set postgresql.objectStore.destinationPath=s3://my-bucket/custom-path
endpointURL: "https://fsn1.your-objectstorage.com"
retentionPolicy: "7d"
s3Credentials:
accessKeyId:
name: ironflow-s3-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: ironflow-s3-creds
key: SECRET_ACCESS_KEY

Verify backups are running:

Terminal window
kubectl get scheduledbackups -n <namespace>
kubectl get backups -n <namespace> --sort-by=.metadata.creationTimestamp

Backup retention

All Helm value templates (default, small, medium, multi-tenant) use a 7-day retention policy for S3 object store backups. CNPG’s Barman Cloud Plugin automatically prunes base backups and WAL segments older than the retention window.

Combined with the default daily ScheduledBackup (runs at 2 AM UTC), this means approximately 7 daily backups exist in your S3 bucket at any point. PITR is available to any point within the 7-day window thanks to continuous WAL archiving.

To change the retention period, override postgresql.objectStore.retentionPolicy in your values file or at install/upgrade time:

Terminal window
# Keep 30 days of backups
helm upgrade <release> deploy/helm/ironflow \
-n ironflow \
--reuse-values \
--set postgresql.objectStore.retentionPolicy="30d"

Or in a values file:

postgresql:
objectStore:
retentionPolicy: "30d"

For VolumeSnapshot-based backups, retention is controlled separately by the cleanup CronJob:

postgresql:
snapshotRetention:
enabled: true
days: 7 # Delete snapshots older than 7 days
schedule: "0 3 * * *" # Run cleanup daily at 3 AM

Choosing a retention period

Longer retention increases S3 storage costs but gives you a wider PITR window. For demo clusters, 7 days is sufficient. For production, consider 30 days or longer depending on compliance requirements.


Recover from accidental namespace deletion

Trigger: Someone ran kubectl delete namespace ironflow. All Ironflow resources in that namespace are gone.

Steps:

Terminal window
# 1. Recreate the namespace
kubectl create namespace ironflow
# 2. Recreate secrets that were in the namespace
# Image pull secret (if using private registry)
kubectl create secret docker-registry ghcr-pull-secret \
--namespace ironflow \
--docker-server=ghcr.io \
--docker-username=YOUR_USERNAME \
--docker-password=YOUR_TOKEN
# Master encryption key (for secrets management)
# Use your saved key — if lost, previously encrypted secrets are unrecoverable
kubectl create secret generic ironflow-master-key -n ironflow \
--from-literal=master-key="YOUR_SAVED_MASTER_KEY"
# 3. Redeploy Ironflow
ironflow deploy --template medium --name my-release
# Or with Helm: helm install ironflow deploy/helm/ironflow/ -n ironflow -f your-values.yaml
# 4. Verify pods are starting
ironflow deploy status --name my-release
kubectl get pods -n ironflow -w
# 5. Restore PostgreSQL data from backup (if available)
# See "Restore PostgreSQL from CNPG backup" above for detailed steps.
# If using bundled PostgreSQL (Small/Medium), it was in the ironflow namespace
# and is now gone — you need a backup to recover data.
# 6. Verify NATS streams are recreated
# Ironflow auto-creates JetStream streams on startup — check logs:
kubectl logs -n ironflow -l app.kubernetes.io/component=server --tail=30 | grep -i "stream\|jetstream\|nats"

If ironflow-system was also deleted (Large template or Hetzner bootstrap):

Terminal window
# Infrastructure namespace is gone — NATS and PostgreSQL need reinstalling
# 1. Recreate the infrastructure namespace
kubectl create namespace ironflow-system
# 2. Reinstall NATS
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm install nats nats/nats -n ironflow-system -f your-nats-values.yaml
# 3. Reinstall CNPG operator (if not in cnpg-system, it should still be there)
kubectl get deployment cnpg-controller-manager -n cnpg-system
# If missing:
kubectl apply --server-side -f \
https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.28/releases/cnpg-1.28.2.yaml
# Also install the Barman Cloud Plugin if missing:
kubectl apply -f \
https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.11.0/manifest.yaml
# 4. Recreate the PostgreSQL cluster
kubectl apply -f your-cnpg-cluster.yaml -n ironflow-system
kubectl get cluster -n ironflow-system -w
# Wait for "Cluster in healthy state"
# 5. Then follow steps 1-6 above to recreate the ironflow namespace and deploy

What is recoverable vs. lost:

ScenarioPostgreSQL dataNATS streamsSecrets/Config
ironflow namespace deleted, ironflow-system intactIntact (PG is in ironflow-system)Intact (NATS is in ironflow-system)KV values intact
ironflow namespace deleted, bundled (Small/Medium)Lost without backupLost (recreated empty on startup)Lost (recreated empty)
Both namespaces deletedLost without backupLost (recreated empty on startup)Lost (recreated empty)

Prevention: Use RBAC to restrict kubectl delete namespace permissions. Consider namespace finalizers or admission webhooks to block accidental deletion in production.

Restore a single tenant from backup (multi-tenant cluster)

Trigger: One tenant’s data is lost or corrupted and needs to be restored from a CNPG backup without affecting other tenants.

Two recovery methods are available:

Use this method when backups are stored in S3 via the Barman Cloud Plugin (the default for multi-tenant deployments). Object store backups survive cluster deletion, so you can restore even after helm uninstall.

Terminal window
# 1. Identify the tenant's CNPG cluster and ObjectStore
kubectl get cluster -n tenant-acme
kubectl get objectstores.barmancloud.cnpg.io -n tenant-acme
# 2. Uninstall the tenant's Helm release (keeps the namespace and secrets)
helm uninstall acme -n tenant-acme
# 3. Wait for all pods to terminate
kubectl get pods -n tenant-acme -w
# 4. Delete any leftover PVCs from the old CNPG cluster
kubectl delete pvc -n tenant-acme -l cnpg.io/cluster=acme-ironflow-postgresql
# 5. Reinstall with recovery from the object store
# Create a temporary values override for plugin-based recovery.
# The serverName must match the original CNPG Cluster name.
cat <<'EOF' > /tmp/tenant-acme-recovery.yaml
postgresql:
recovery:
source: origin
externalClusters:
- name: origin
plugin:
name: barman-cloud.cloudnative-pg.io
parameters:
barmanObjectName: acme-ironflow-objectstore
serverName: acme-ironflow-postgresql
EOF
helm install acme ./deploy/helm/ironflow \
-n tenant-acme \
-f deploy/helm/ironflow/values-multi-tenant.yaml \
-f /tmp/tenant-acme-recovery.yaml \
--set ingress.host=acme.ironflow.example.com \
--set ironflow.masterKey=$ACME_MASTER_KEY # Use the same master key as before
# 6. Wait for recovery to complete
kubectl get cluster -n tenant-acme -w
# Wait for STATUS = "Cluster in healthy state"
# 7. Verify tenant health
kubectl exec -n tenant-acme deploy/acme-ironflow -- wget -qO- http://localhost:9123/health
# 8. Clean up temporary file
rm /tmp/tenant-acme-recovery.yaml

Point-in-time recovery (PITR): To restore to a specific timestamp (e.g., just before data corruption), add a recoveryTarget to the recovery values:

postgresql:
recovery:
source: origin
recoveryTarget:
targetTime: "2026-04-05T10:00:00Z" # Restore to this point in time
externalClusters:
- name: origin
plugin:
name: barman-cloud.cloudnative-pg.io
parameters:
barmanObjectName: acme-ironflow-objectstore
serverName: acme-ironflow-postgresql

Other tenants are completely unaffected — each tenant’s CNPG cluster and backups are scoped to their own namespace.

Master key

You must use the same master key the tenant was originally deployed with. If you lost it, any encrypted secrets in the database will be unrecoverable. Store master keys securely outside the cluster (e.g., in a password manager or external secret store).

Finding the right names

The barmanObjectName and serverName in the recovery values must match the original deployment. For a Helm release named acme with default chart settings:

  • barmanObjectName: acme-ironflow-objectstore
  • serverName: acme-ironflow-postgresql

You can verify these by checking: helm get values acme -n tenant-acme