Skip to content

Alerts & Monitoring Scenarios

First alert fires in Slack — now what?

Trigger: First-ever Slack notification from Alertmanager. Not sure what to do.

Steps:

Terminal window
# 1. Read the Slack message carefully. It contains:
# - Status: FIRING or RESOLVED
# - Severity: CRITICAL or WARNING
# - Summary: one-line description
# - Description: details about what's wrong
# - Runbook: link to step-by-step response (if available)
# 2. Port-forward Alertmanager to see ALL active alerts (not just the one in Slack)
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093 — shows every firing alert grouped by alertname + namespace
# 3. Port-forward Grafana to check dashboards for visual context
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
# Open http://localhost:3000
# Default credentials: admin / (from grafana-admin secret)
# Look at the Ironflow Performance dashboard for error rate, latency, run failures
# 4. If the alert includes a runbook_url, follow that link.
# Runbooks exist for: IronflowDown, HighErrorRate, NATSDown, PostgreSQLDown
# 5. Port-forward Prometheus to query metrics directly
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090
# Try queries like:
# up{job=~".*ironflow.*"} — is Ironflow scraped?
# rate(ironflow_http_requests_total{status_code=~"5.."}[5m]) — current error rate
# ironflow_runs_total{status="failed"} — failed runs

Severity hierarchy:

SeverityMeaningResponse time
criticalService impact. Immediate investigation required.Now — page on-call if after hours
warningDegraded or trending toward failure. Investigate during business hours.Next business day

Critical alerts suppress matching warning alerts via the inhibition rule in Alertmanager config, so you will not see both for the same issue.

See Runbooks for alert-specific response procedures. See kubectl Operations for general cluster diagnostics.


Alertmanager not sending to Slack

Trigger: AlertmanagerFailedNotifications warning fires, or you expected an alert but Slack is silent.

Steps:

Terminal window
# 1. Check Alertmanager pod logs for send errors
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50
# Look for: "notify" errors, "webhook" errors, HTTP status codes from Slack
# 2. Verify the Slack webhook secret exists and has a value
kubectl get secret alertmanager-slack -n monitoring
kubectl get secret alertmanager-slack -n monitoring -o jsonpath='{.data.webhook-url}' | base64 -d
# Should output a URL like https://hooks.slack.com/services/T.../B.../...
# 3. Test the webhook URL directly with curl
WEBHOOK_URL=$(kubectl get secret alertmanager-slack -n monitoring \
-o jsonpath='{.data.webhook-url}' | base64 -d)
curl -X POST -H 'Content-Type: application/json' \
-d '{"text":"Test from Ironflow ops"}' \
"$WEBHOOK_URL"
# Should return "ok". If it returns "invalid_payload" or "channel_not_found",
# the webhook is misconfigured in Slack.
# 4. Verify Alertmanager loaded the config correctly
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093/#/status
# Check that the "slack" receiver shows the correct channel (#ironflow-alerts)
# and that api_url_file points to the mounted secret path
# 5. If the secret was rotated or recreated, restart Alertmanager to pick it up
kubectl rollout restart statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring
kubectl rollout status statefulset alertmanager-kube-prometheus-stack-alertmanager -n monitoring

Common causes:

  • Slack webhook URL expired or was revoked (Slack admin deleted the app)
  • Secret was deleted and recreated but Alertmanager pod was not restarted
  • Slack channel was archived or renamed (webhook returns channel_not_found)
  • Network policy blocking egress from the monitoring namespace

See Rotate Slack webhook URL for the full rotation procedure.


Healthchecks.io reports “down” (dead man’s switch)

Trigger: Email/SMS from Healthchecks.io saying the Watchdog heartbeat stopped.

The Watchdog alert is a special “always-firing” alert built into kube-prometheus-stack. Alertmanager routes it to Healthchecks.io every 1 minute. If HC stops receiving pings, your entire monitoring pipeline is broken.

Steps:

Terminal window
# 1. Check Prometheus is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus
# Should show 1/1 Running. If not, the Watchdog alert cannot fire.
# 2. Check Alertmanager is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Should show 1/1 Running. If not, alerts cannot be routed.
# 3. Verify the healthchecks-io secret exists
kubectl get secret healthchecks-io -n monitoring
kubectl get secret healthchecks-io -n monitoring -o jsonpath='{.data.ping-url}' | base64 -d
# Should output a URL like https://hc-ping.com/YOUR-UUID
# 4. Check Alertmanager logs for errors sending to Healthchecks.io
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50 | grep -i "healthchecks\|webhook\|error"
# 5. Verify the Watchdog alert is actually firing in Prometheus
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090/alerts
# Look for "Watchdog" under the alerts list — it should show as FIRING (green)
# If Watchdog is not listed, Prometheus rules are not loaded correctly
# 6. If Prometheus and Alertmanager are both healthy, test the HC URL manually
HC_URL=$(kubectl get secret healthchecks-io -n monitoring \
-o jsonpath='{.data.ping-url}' | base64 -d)
curl -fsS "$HC_URL"
# Should return "ok". If it fails, the HC check URL may have been deleted/recreated.

Common causes:

SymptomCause
Prometheus pod is downNo alerts fire at all, including Watchdog
Alertmanager pod is downAlerts fire but cannot be routed
healthchecks-io secret deletedAlertmanager cannot read the ping URL
Network egress blockedAlertmanager cannot reach hc-ping.com

Prometheus running out of storage

Trigger: DiskSpaceLow alert for a Prometheus PVC, or Prometheus stops ingesting metrics and logs show “no space left on device.”

Steps:

Terminal window
# 1. Check PVC usage
kubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus
# Note the NAME and current CAPACITY
# Check actual usage from inside the pod
kubectl exec -n monitoring prometheus-kube-prometheus-stack-prometheus-0 \
-c prometheus -- df -h /prometheus
# Look at Use% — above 85% is the alert threshold

Works if the storage class supports volume expansion (hcloud-volumes does).

Terminal window
# Patch the PVC to a larger size
kubectl patch pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 \
-n monitoring \
-p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Wait for the resize to complete
kubectl get pvc -n monitoring -l app.kubernetes.io/name=prometheus -w
# STATUS should change from "FileSystemResizePending" to "Bound"
# Also update the Helm values so future upgrades don't revert the size
# Edit deploy/monitoring/kube-prometheus-stack-values.yaml:
# storageSpec.volumeClaimTemplate.spec.resources.requests.storage: 20Gi

Grafana dashboard shows “No Data”

Trigger: Dashboard panels empty after deploy. Prometheus is running but no Ironflow metrics appear.

Steps:

9090/targets
# 1. Verify Prometheus is running and scraping Ironflow
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Look for an "ironflow" target. It should show State = UP.
# If the target is missing entirely, Prometheus doesn't know about Ironflow.
# 2. Check that the ServiceMonitor exists
kubectl get servicemonitor -n ironflow
# Should list an ironflow ServiceMonitor. If missing, metrics are not enabled
# in the Helm values.
# 3. Verify the ServiceMonitor has the correct release label
kubectl get servicemonitor -n ironflow -o yaml | grep -A2 "labels:"
# The ServiceMonitor MUST have a label matching the kube-prometheus-stack
# release name. The Ironflow chart sets: release: {{ .Release.Name }}
# If kube-prometheus-stack was installed as "kube-prometheus-stack",
# the ServiceMonitor label must be: release: kube-prometheus-stack
#
# The default kube-prometheus-stack-values.yaml sets:
# serviceMonitorSelector: {} (matches all)
# so label mismatch is only a problem if you customized the selector.
# 4. Check Ironflow has metrics enabled
kubectl get configmap -n ironflow -l app.kubernetes.io/name=ironflow -o yaml | grep -i metrics
# Or check the environment variable:
kubectl get deployment ironflow -n ironflow \
-o jsonpath='{.spec.template.spec.containers[0].env}' | python3 -m json.tool | grep -A1 METRICS
# IRONFLOW_METRICS_ENABLED should be "true"
# If not, enable it:
ironflow deploy upgrade --template medium --name my-release \
--set observability.metrics.enabled=true \
--set serviceMonitor.enabled=true
# Or with Helm directly:
# helm upgrade ironflow deploy/helm/ironflow/ -n ironflow \
# --reuse-values --set observability.metrics.enabled=true --set serviceMonitor.enabled=true
# 5. Try a direct PromQL query to see if any Ironflow metrics exist
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# In Prometheus UI, try: {job=~".*ironflow.*"}
# If results appear, the issue is the dashboard query, not scraping.
# If no results, scraping is the problem — go back to steps 1-4.

Common causes:

SymptomCauseFix
No ironflow target in /targetsServiceMonitor missingSet serviceMonitor.enabled=true in Helm values
Target shows “DOWN”Ironflow metrics endpoint not respondingSet observability.metrics.enabled=true
Target shows “UP” but no dataDashboard queries wrong metric namesCheck dashboard JSON matches actual metric names
ServiceMonitor exists but no targetLabel selector mismatchVerify serviceMonitorSelector in kube-prometheus-stack or add correct release label

See kubectl Operations for more Prometheus and Grafana troubleshooting commands.


False alert — how to tune or silence

Trigger: Alert keeps firing but the system is healthy. Threshold too sensitive or condition is expected.

Steps:

Terminal window
# Option 1: Silence temporarily (e.g., during maintenance)
kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093/#/silences
# Click "New Silence", fill in:
# - Matchers: alertname = HighRunFailureRate (or whichever alert)
# - Duration: 2h (or however long you need)
# - Comment: "Maintenance window — expected elevated failure rate"
Terminal window
# Option 2: Adjust the alert threshold
# Edit deploy/helm/ironflow/templates/ironflow-alerts.yaml
# Example: raise HighErrorRate threshold from 5% to 10%
# Change: ) > 0.05
# To: ) > 0.10
# Example: extend the "for" duration from 5m to 15m (less sensitive)
# Change: for: 5m
# To: for: 15m
# Apply the updated rules via Helm upgrade
helm upgrade ironflow deploy/helm/ironflow/ -n ironflow --reuse-values
# Verify Prometheus loaded the new rules
kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090
# Open http://localhost:9090/rules — check the rule shows the new threshold
Terminal window
# Option 3: Add an inhibition rule (suppress one alert when another fires)
# Edit deploy/monitoring/kube-prometheus-stack-values.yaml
# Under alertmanager.config.inhibit_rules, add a new rule. Example:
#
# - source_matchers:
# - alertname = IronflowDown
# target_matchers:
# - alertname = HighErrorRate
# equal: ['namespace']
#
# This suppresses HighErrorRate when IronflowDown is already firing
# (since 100% errors are expected when the service is down).
# Apply via Helm upgrade
helm upgrade kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f deploy/monitoring/kube-prometheus-stack-values.yaml \
--wait --timeout 300s

Common tuning examples:

AlertDefaultWhen to tune
HighErrorRate> 5% for 5mRaise to 10% if background jobs cause occasional 5xx
HighRunFailureRate> 5% for 15mRaise threshold if user functions are expected to fail often
MemoryPressure> 85% for 10mRaise to 90% if pods run fine at high memory usage
WorkerDisconnected0 workers for 5mExtend to 15m if workers cycle during deploys
HighLatencyP99 > 2s for 5mRaise to 5s for batch-heavy workloads