Alerts & Monitoring Scenarios
First alert fires in Slack — now what?
Trigger: First-ever Slack notification from Alertmanager. Not sure what to do.
Steps:
# 1. Read the Slack message carefully. It contains:# - Status: FIRING or RESOLVED# - Severity: CRITICAL or WARNING# - Summary: one-line description# - Description: details about what's wrong# - Runbook: link to step-by-step response (if available)
# 2. Port-forward Alertmanager to see ALL active alerts (not just the one in Slack)kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093# Open http://localhost:9093 — shows every firing alert grouped by alertname + namespace
# 3. Port-forward Grafana to check dashboards for visual contextkubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80# Open http://localhost:3000# Default credentials: admin / (from grafana-admin secret)# Look at the Ironflow Performance dashboard for error rate, latency, run failures
# 4. If the alert includes a runbook_url, follow that link.# Runbooks exist for: IronflowDown, HighErrorRate, NATSDown, PostgreSQLDown
# 5. Port-forward Prometheus to query metrics directlykubectl port-forward svc/prometheus-operated -n monitoring 9090:9090# Open http://localhost:9090# Try queries like:# up{job=~".*ironflow.*"} — is Ironflow scraped?# rate(ironflow_http_requests_total{status_code=~"5.."}[5m]) — current error rate# ironflow_runs_total{status="failed"} — failed runsSeverity hierarchy:
| Severity | Meaning | Response time |
|---|---|---|
| critical | Service impact. Immediate investigation required. | Now — page on-call if after hours |
| warning | Degraded or trending toward failure. Investigate during business hours. | Next business day |
Critical alerts suppress matching warning alerts via the inhibition rule in Alertmanager config, so you will not see both for the same issue.
See Runbooks for alert-specific response procedures. See kubectl Operations for general cluster diagnostics.
Alertmanager not sending to Slack
Trigger: AlertmanagerFailedNotifications warning fires, or you expected an alert but Slack is silent.
Steps:
# 1. Check Alertmanager pod logs for send errorskubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50# Look for: "notify" errors, "webhook" errors, HTTP status codes from Slack
# 2. Verify the Slack webhook secret exists and has a valuekubectl get secret alertmanager-slack -n monitoringkubectl get secret alertmanager-slack -n monitoring -o jsonpath='{.data.webhook-url}' | base64 -d# Should output a URL like https://hooks.slack.com/services/T.../B.../...
# 3. Test the webhook URL directly with curlWEBHOOK_URL=$(kubectl get secret alertmanager-slack -n monitoring \ -o jsonpath='{.data.webhook-url}' | base64 -d)curl -X POST -H 'Content-Type: application/json' \ -d '{"text":"Test from Ironflow ops"}' \ "$WEBHOOK_URL"# Should return "ok". If it returns "invalid_payload" or "channel_not_found",# the webhook is misconfigured in Slack.
# 4. Verify Alertmanager loaded the config correctlykubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093# Open http://localhost:9093/#/status# Check that the "slack" receiver shows the correct channel (#ironflow-alerts)# and that api_url_file points to the mounted secret path
# 5. If the secret was rotated or recreated, restart Alertmanager to pick it upkubectl rollout restart statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoringkubectl rollout status statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoringCommon causes:
- Slack webhook URL expired or was revoked (Slack admin deleted the app)
- Secret was deleted and recreated but Alertmanager pod was not restarted
- Slack channel was archived or renamed (webhook returns
channel_not_found) - Network policy blocking egress from the monitoring namespace
See Rotate Slack webhook URL for the full rotation procedure.
Healthchecks.io reports “down” (dead man’s switch)
Trigger: Email/SMS from Healthchecks.io saying the Watchdog heartbeat stopped.
The Watchdog alert is a special “always-firing” alert built into kube-prometheus-stack. Alertmanager routes it to Healthchecks.io every 1 minute. If HC stops receiving pings, your entire monitoring pipeline is broken.
Steps:
# 1. Check Prometheus is runningkubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus# Should show 1/1 Running. If not, the Watchdog alert cannot fire.
# 2. Check Alertmanager is runningkubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager# Should show 1/1 Running. If not, alerts cannot be routed.
# 3. Verify the healthchecks-io secret existskubectl get secret healthchecks-io -n monitoringkubectl get secret healthchecks-io -n monitoring -o jsonpath='{.data.ping-url}' | base64 -d# Should output a URL like https://hc-ping.com/YOUR-UUID
# 4. Check Alertmanager logs for errors sending to Healthchecks.iokubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 | grep -i "healthchecks\|webhook\|error"
# 5. Verify the Watchdog alert is actually firing in Prometheuskubectl port-forward svc/prometheus-operated -n monitoring 9090:9090# Open http://localhost:9090/alerts# Look for "Watchdog" under the alerts list — it should show as FIRING (green)# If Watchdog is not listed, Prometheus rules are not loaded correctly
# 6. If Prometheus and Alertmanager are both healthy, test the HC URL manuallyHC_URL=$(kubectl get secret healthchecks-io -n monitoring \ -o jsonpath='{.data.ping-url}' | base64 -d)curl -fsS "$HC_URL"# Should return "ok". If it fails, the HC check URL may have been deleted/recreated.Common causes:
| Symptom | Cause |
|---|---|
| Prometheus pod is down | No alerts fire at all, including Watchdog |
| Alertmanager pod is down | Alerts fire but cannot be routed |
healthchecks-io secret deleted | Alertmanager cannot read the ping URL |
| Network egress blocked | Alertmanager cannot reach hc-ping.com |
Prometheus running out of storage
Trigger: DiskSpaceLow alert for a Prometheus PVC, or Prometheus stops ingesting metrics and logs show “no space left on device.”
Steps:
# 1. Check PVC usagekubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus# Note the NAME and current CAPACITY
# Check actual usage from inside the podkubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \ -c prometheus -- df -h /prometheus# Look at Use% — above 85% is the alert thresholdWorks if the storage class supports volume expansion (hcloud-volumes does).
# Patch the PVC to a larger sizekubectl patch pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 \ -n monitoring \ -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Wait for the resize to completekubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus -w# STATUS should change from "FileSystemResizePending" to "Bound"
# Also update the Helm values so future upgrades don't revert the size# Edit deploy/monitoring/kube-prometheus-stack-values.yaml:# storageSpec.volumeClaimTemplate.spec.resources.requests.storage: 20GiReduce the retention window so Prometheus drops older data.
# Default retention is 15 days. Reduce to 7 days:helm upgrade kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ --namespace monitoring \ -f deploy/monitoring/kube-prometheus-stack-values.yaml \ --set prometheus.prometheusSpec.retention=7d \ --wait --timeout 300s
# Prometheus will compact and drop data outside the new window# Also update deploy/monitoring/kube-prometheus-stack-values.yaml to persist:# retention: 7dRestarting Prometheus triggers a WAL replay and compaction, which can reclaim space from deleted/compacted blocks.
kubectl rollout restart statefulset prometheus-kube-prometheus-stack-prometheus -n monitoringkubectl rollout status statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring
# Verify usage droppedkubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \ -c prometheus -- df -h /prometheusThis does not lose data within the retention window. Prometheus replays its write-ahead log on startup.
Grafana dashboard shows “No Data”
Trigger: Dashboard panels empty after deploy. Prometheus is running but no Ironflow metrics appear.
Steps:
# 1. Verify Prometheus is running and scraping Ironflowkubectl port-forward svc/prometheus-operated -n monitoring 9090:9090# Look for an "ironflow" target. It should show State = UP.# If the target is missing entirely, Prometheus doesn't know about Ironflow.
# 2. Check that the ServiceMonitor existskubectl get servicemonitor -n ironflow# Should list an ironflow ServiceMonitor. If missing, metrics are not enabled# in the Helm values.
# 3. Verify the ServiceMonitor has the correct release labelkubectl get servicemonitor -n ironflow -o yaml | grep -A2 "labels:"# The ServiceMonitor MUST have a label matching the kube-prometheus-stack# release name. The Ironflow chart sets: release: {{ .Release.Name }}# If kube-prometheus-stack was installed as "kube-prometheus-stack",# the ServiceMonitor label must be: release: kube-prometheus-stack## The default kube-prometheus-stack-values.yaml sets:# serviceMonitorSelector: {} (matches all)# so label mismatch is only a problem if you customized the selector.
# 4. Check Ironflow has metrics enabledkubectl get configmap -n ironflow -l app.kubernetes.io/name=ironflow -o yaml | grep -i metrics# Or check the environment variable:kubectl get deployment ironflow -n ironflow \ -o jsonpath='{.spec.template.spec.containers[0].env}' | python3 -m json.tool | grep -A1 METRICS# IRONFLOW_METRICS_ENABLED should be "true"# If not, enable it:ironflow deploy upgrade --template medium --name my-release \ --set observability.metrics.enabled=true \ --set serviceMonitor.enabled=true
# Or with Helm directly:# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \# --reuse-values --set observability.metrics.enabled=true --set serviceMonitor.enabled=true
# 5. Try a direct PromQL query to see if any Ironflow metrics existkubectl port-forward svc/prometheus-operated -n monitoring 9090:9090# In Prometheus UI, try: {job=~".*ironflow.*"}# If results appear, the issue is the dashboard query, not scraping.# If no results, scraping is the problem — go back to steps 1-4.Common causes:
| Symptom | Cause | Fix |
|---|---|---|
| No ironflow target in /targets | ServiceMonitor missing | Set serviceMonitor.enabled=true in Helm values |
| Target shows “DOWN” | Ironflow metrics endpoint not responding | Set observability.metrics.enabled=true |
| Target shows “UP” but no data | Dashboard queries wrong metric names | Check dashboard JSON matches actual metric names |
| ServiceMonitor exists but no target | Label selector mismatch | Verify serviceMonitorSelector in kube-prometheus-stack or add correct release label |
See kubectl Operations for more Prometheus and Grafana troubleshooting commands.
False alert — how to tune or silence
Trigger: Alert keeps firing but the system is healthy. Threshold too sensitive or condition is expected.
Steps:
# Option 1: Silence temporarily (e.g., during maintenance)kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093# Open http://localhost:9093/#/silences# Click "New Silence", fill in:# - Matchers: alertname = HighRunFailureRate (or whichever alert)# - Duration: 2h (or however long you need)# - Comment: "Maintenance window — expected elevated failure rate"# Option 2: Adjust the alert threshold# Edit deploy/helm/ironflow/templates/ironflow-alerts.yaml
# Example: raise HighErrorRate threshold from 5% to 10%# Change: ) > 0.05# To: ) > 0.10
# Example: extend the "for" duration from 5m to 15m (less sensitive)# Change: for: 5m# To: for: 15m
# Apply the updated rules via Helm upgradehelm upgrade ironflow deploy/helm/ironflow/ -n ironflow --reuse-values
# Verify Prometheus loaded the new ruleskubectl port-forward svc/prometheus-operated -n monitoring 9090:9090# Open http://localhost:9090/rules — check the rule shows the new threshold# Option 3: Add an inhibition rule (suppress one alert when another fires)# Edit deploy/monitoring/kube-prometheus-stack-values.yaml# Under alertmanager.config.inhibit_rules, add a new rule. Example:## - source_matchers:# - alertname = IronflowDown# target_matchers:# - alertname = HighErrorRate# equal: ['namespace']## This suppresses HighErrorRate when IronflowDown is already firing# (since 100% errors are expected when the service is down).
# Apply via Helm upgradehelm upgrade kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ --namespace monitoring \ -f deploy/monitoring/kube-prometheus-stack-values.yaml \ --wait --timeout 300sCommon tuning examples:
| Alert | Default | When to tune |
|---|---|---|
| HighErrorRate | > 5% for 5m | Raise to 10% if background jobs cause occasional 5xx |
| HighRunFailureRate | > 5% for 15m | Raise threshold if user functions are expected to fail often |
| MemoryPressure | > 85% for 10m | Raise to 90% if pods run fine at high memory usage |
| WorkerDisconnected | 0 workers for 5m | Extend to 15m if workers cycle during deploys |
| HighLatency | P99 > 2s for 5m | Raise to 5s for batch-heavy workloads |