Alerts & Monitoring Scenarios

First alert fires in Slack — now what?

Trigger: First-ever Slack notification from Alertmanager. Not sure what to do.

Steps:

# 1. Read the Slack message carefully. It contains:
#    - Status: FIRING or RESOLVED
#    - Severity: CRITICAL or WARNING
#    - Summary: one-line description
#    - Description: details about what's wrong
#    - Runbook: link to step-by-step response (if available)

# 2. Port-forward Alertmanager to see ALL active alerts (not just the one in Slack)
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093 — shows every firing alert grouped by alertname + namespace

# 3. Port-forward Grafana to check dashboards for visual context
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
# Open http://localhost:3000
# Default credentials: admin / (from grafana-admin secret)
# Look at the Ironflow Performance dashboard for error rate, latency, run failures

# 4. If the alert includes a runbook_url, follow that link.
#    Runbooks exist for: IronflowDown, HighErrorRate, NATSDown, PostgreSQLDown

# 5. Port-forward Prometheus to query metrics directly
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090
# Try queries like:
#   up{job=~".*ironflow.*"}                           — is Ironflow scraped?
#   rate(ironflow_http_requests_total{status_code=~"5.."}[5m])  — current error rate
#   ironflow_runs_total{status="failed"}              — failed runs

Severity hierarchy:

Severity	Meaning	Response time
critical	Service impact. Immediate investigation required.	Now — page on-call if after hours
warning	Degraded or trending toward failure. Investigate during business hours.	Next business day

Critical alerts suppress matching warning alerts via the inhibition rule in Alertmanager config, so you will not see both for the same issue.

See Runbooks for alert-specific response procedures. See kubectl Operations for general cluster diagnostics.

Alertmanager not sending to Slack

Trigger: AlertmanagerFailedNotifications warning fires, or you expected an alert but Slack is silent.

Steps:

# 1. Check Alertmanager pod logs for send errors
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50
# Look for: "notify" errors, "webhook" errors, HTTP status codes from Slack

# 2. Verify the Slack webhook secret exists and has a value
kubectl get secret alertmanager-slack -n monitoring
kubectl get secret alertmanager-slack -n monitoring -o jsonpath='{.data.webhook-url}' | base64 -d
# Should output a URL like https://hooks.slack.com/services/T.../B.../...

# 3. Test the webhook URL directly with curl
WEBHOOK_URL=$(kubectl get secret alertmanager-slack -n monitoring \
  -o jsonpath='{.data.webhook-url}' | base64 -d)
curl -X POST -H 'Content-Type: application/json' \
  -d '{"text":"Test from Ironflow ops"}' \
  "$WEBHOOK_URL"
# Should return "ok". If it returns "invalid_payload" or "channel_not_found",
# the webhook is misconfigured in Slack.

# 4. Verify Alertmanager loaded the config correctly
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093/#/status
# Check that the "slack" receiver shows the correct channel (#ironflow-alerts)
# and that api_url_file points to the mounted secret path

# 5. If the secret was rotated or recreated, restart Alertmanager to pick it up
kubectl rollout restart statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring
kubectl rollout status statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring

Common causes:

Slack webhook URL expired or was revoked (Slack admin deleted the app)
Secret was deleted and recreated but Alertmanager pod was not restarted
Slack channel was archived or renamed (webhook returns channel_not_found)
Network policy blocking egress from the monitoring namespace

See Rotate Slack webhook URL for the full rotation procedure.

Healthchecks.io reports “down” (dead man’s switch)

Trigger: Email/SMS from Healthchecks.io saying the Watchdog heartbeat stopped.

The Watchdog alert is a special “always-firing” alert built into kube-prometheus-stack. Alertmanager routes it to Healthchecks.io every 1 minute. If HC stops receiving pings, your entire monitoring pipeline is broken.

Steps:

# 1. Check Prometheus is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
# Should show 1/1 Running. If not, the Watchdog alert cannot fire.

# 2. Check Alertmanager is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Should show 1/1 Running. If not, alerts cannot be routed.

# 3. Verify the healthchecks-io secret exists
kubectl get secret healthchecks-io -n monitoring
kubectl get secret healthchecks-io -n monitoring -o jsonpath='{.data.ping-url}' | base64 -d
# Should output a URL like https://hc-ping.com/YOUR-UUID

# 4. Check Alertmanager logs for errors sending to Healthchecks.io
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 | grep -i "healthchecks\|webhook\|error"

# 5. Verify the Watchdog alert is actually firing in Prometheus
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090/alerts
# Look for "Watchdog" under the alerts list — it should show as FIRING (green)
# If Watchdog is not listed, Prometheus rules are not loaded correctly

# 6. If Prometheus and Alertmanager are both healthy, test the HC URL manually
HC_URL=$(kubectl get secret healthchecks-io -n monitoring \
  -o jsonpath='{.data.ping-url}' | base64 -d)
curl -fsS "$HC_URL"
# Should return "ok". If it fails, the HC check URL may have been deleted/recreated.

Common causes:

Symptom	Cause
Prometheus pod is down	No alerts fire at all, including Watchdog
Alertmanager pod is down	Alerts fire but cannot be routed
`healthchecks-io` secret deleted	Alertmanager cannot read the ping URL
Network egress blocked	Alertmanager cannot reach `hc-ping.com`

Prometheus running out of storage

Trigger: DiskSpaceLow alert for a Prometheus PVC, or Prometheus stops ingesting metrics and logs show “no space left on device.”

Steps:

# 1. Check PVC usage
kubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus
# Note the NAME and current CAPACITY

# Check actual usage from inside the pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -c prometheus -- df -h /prometheus
# Look at Use% — above 85% is the alert threshold

Works if the storage class supports volume expansion (hcloud-volumes does).

# Patch the PVC to a larger size
kubectl patch pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 \
  -n monitoring \
  -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

# Wait for the resize to complete
kubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus -w
# STATUS should change from "FileSystemResizePending" to "Bound"

# Also update the Helm values so future upgrades don't revert the size
# Edit deploy/monitoring/kube-prometheus-stack-values.yaml:
#   storageSpec.volumeClaimTemplate.spec.resources.requests.storage: 20Gi

Reduce the retention window so Prometheus drops older data.

# Default retention is 15 days. Reduce to 7 days:
helm upgrade kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f deploy/monitoring/kube-prometheus-stack-values.yaml \
  --set prometheus.prometheusSpec.retention=7d \
  --wait --timeout 300s

# Prometheus will compact and drop data outside the new window
# Also update deploy/monitoring/kube-prometheus-stack-values.yaml to persist:
#   retention: 7d

Restarting Prometheus triggers a WAL replay and compaction, which can reclaim space from deleted/compacted blocks.

kubectl rollout restart statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring
kubectl rollout status statefulset prometheus-kube-prometheus-stack-prometheus -n monitoring

# Verify usage dropped
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
  -c prometheus -- df -h /prometheus

This does not lose data within the retention window. Prometheus replays its write-ahead log on startup.

Grafana dashboard shows “No Data”

Trigger: Dashboard panels empty after deploy. Prometheus is running but no Ironflow metrics appear.

Steps:

# 1. Verify Prometheus is running and scraping Ironflow
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Look for an "ironflow" target. It should show State = UP.
# If the target is missing entirely, Prometheus doesn't know about Ironflow.

# 2. Check that the ServiceMonitor exists
kubectl get servicemonitor -n ironflow
# Should list an ironflow ServiceMonitor. If missing, metrics are not enabled
# in the Helm values.

# 3. Verify the ServiceMonitor has the correct release label
kubectl get servicemonitor -n ironflow -o yaml | grep -A2 "labels:"
# The ServiceMonitor MUST have a label matching the kube-prometheus-stack
# release name. The Ironflow chart sets: release: {{ .Release.Name }}
# If kube-prometheus-stack was installed as "kube-prometheus-stack",
# the ServiceMonitor label must be: release: kube-prometheus-stack
#
# The default kube-prometheus-stack-values.yaml sets:
#   serviceMonitorSelector: {}  (matches all)
# so label mismatch is only a problem if you customized the selector.

# 4. Check Ironflow has metrics enabled
kubectl get configmap -n ironflow -l app.kubernetes.io/name=ironflow -o yaml | grep -i metrics
# Or check the environment variable:
kubectl get deployment ironflow -n ironflow \
  -o jsonpath='{.spec.template.spec.containers[0].env}' | python3 -m json.tool | grep -A1 METRICS
# IRONFLOW_METRICS_ENABLED should be "true"
# If not, enable it:
ironflow deploy upgrade --template medium --name my-release \
  --set observability.metrics.enabled=true \
  --set serviceMonitor.enabled=true

# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
#   --reuse-values --set observability.metrics.enabled=true --set serviceMonitor.enabled=true

# 5. Try a direct PromQL query to see if any Ironflow metrics exist
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# In Prometheus UI, try: {job=~".*ironflow.*"}
# If results appear, the issue is the dashboard query, not scraping.
# If no results, scraping is the problem — go back to steps 1-4.

Common causes:

Symptom	Cause	Fix
No ironflow target in /targets	ServiceMonitor missing	Set `serviceMonitor.enabled=true` in Helm values
Target shows “DOWN”	Ironflow metrics endpoint not responding	Set `observability.metrics.enabled=true`
Target shows “UP” but no data	Dashboard queries wrong metric names	Check dashboard JSON matches actual metric names
ServiceMonitor exists but no target	Label selector mismatch	Verify `serviceMonitorSelector` in kube-prometheus-stack or add correct `release` label

See kubectl Operations for more Prometheus and Grafana troubleshooting commands.

False alert — how to tune or silence

Trigger: Alert keeps firing but the system is healthy. Threshold too sensitive or condition is expected.

Steps:

# Option 1: Silence temporarily (e.g., during maintenance)
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093/#/silences
# Click "New Silence", fill in:
#   - Matchers: alertname = HighRunFailureRate (or whichever alert)
#   - Duration: 2h (or however long you need)
#   - Comment: "Maintenance window — expected elevated failure rate"

# Option 2: Adjust the alert threshold
# Edit deploy/helm/ironflow/templates/ironflow-alerts.yaml

# Example: raise HighErrorRate threshold from 5% to 10%
# Change:  ) > 0.05
# To:      ) > 0.10

# Example: extend the "for" duration from 5m to 15m (less sensitive)
# Change:  for: 5m
# To:      for: 15m

# Apply the updated rules via Helm upgrade
helm upgrade ironflow deploy/helm/ironflow/ -n ironflow --reuse-values

# Verify Prometheus loaded the new rules
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090/rules — check the rule shows the new threshold

# Option 3: Add an inhibition rule (suppress one alert when another fires)
# Edit deploy/monitoring/kube-prometheus-stack-values.yaml
# Under alertmanager.config.inhibit_rules, add a new rule. Example:
#
#   - source_matchers:
#       - alertname = IronflowDown
#     target_matchers:
#       - alertname = HighErrorRate
#     equal: ['namespace']
#
# This suppresses HighErrorRate when IronflowDown is already firing
# (since 100% errors are expected when the service is down).

# Apply via Helm upgrade
helm upgrade kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f deploy/monitoring/kube-prometheus-stack-values.yaml \
  --wait --timeout 300s

Common tuning examples:

Alert	Default	When to tune
HighErrorRate	> 5% for 5m	Raise to 10% if background jobs cause occasional 5xx
HighRunFailureRate	> 5% for 15m	Raise threshold if user functions are expected to fail often
MemoryPressure	> 85% for 10m	Raise to 90% if pods run fine at high memory usage
WorkerDisconnected	0 workers for 5m	Extend to 15m if workers cycle during deploys
HighLatency	P99 > 2s for 5m	Raise to 5s for batch-heavy workloads