Observability
Ironflow provides built-in observability through OpenTelemetry distributed tracing and Prometheus metrics. Both are optional — zero overhead when disabled.
Key Concepts
| Feature | Description |
|---|---|
| Prometheus Metrics | Counter, histogram, and gauge metrics for runs, steps, events, and HTTP requests |
| OpenTelemetry Tracing | Distributed trace spans for workflow runs, steps, and HTTP requests |
| W3C Trace Context | Automatic propagation of traceparent/tracestate headers |
| Zero Overhead | No performance impact when observability is disabled |
Configuration
Configure observability via environment variables:
| Variable | Description | Default |
|---|---|---|
IRONFLOW_METRICS_ENABLED | Enable Prometheus metrics at /metrics | false |
IRONFLOW_OTEL_ENDPOINT | OTLP gRPC endpoint for tracing (empty = disabled) | (empty) |
IRONFLOW_OTEL_SAMPLE_RATE | Trace sampling rate (0.0 to 1.0) | 1.0 |
IRONFLOW_OTEL_SERVICE_NAME | Service name in trace data | ironflow |
IRONFLOW_OTEL_INSECURE | Use plaintext gRPC for OTLP export (set false for TLS) | true |
Prometheus Metrics
When IRONFLOW_METRICS_ENABLED=true, the /metrics endpoint serves Prometheus exposition format on the main API port.
Available Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ironflow_runs_total | Counter | function_id, status, environment, failure_cause | Total workflow runs. failure_cause is platform or user for failed runs, empty for completed. |
ironflow_run_duration_seconds | Histogram | function_id, environment | Run execution duration |
ironflow_steps_total | Counter | function_id, step_name, status | Total step executions |
ironflow_step_duration_seconds | Histogram | function_id, step_name | Step execution duration |
ironflow_events_emitted_total | Counter | event_name, environment | Events emitted |
ironflow_active_runs | Gauge | function_id, environment | Currently active runs |
ironflow_http_requests_total | Counter | method, path, status_code | HTTP requests |
ironflow_http_request_duration_seconds | Histogram | method, path | HTTP request duration |
ironflow_workers_connected | Gauge | — | Currently connected pull-mode workers |
ironflow_worker_disconnects_total | Counter | reason | Worker disconnections by reason |
ironflow_worker_active_jobs | Gauge | — | Active jobs across all workers |
ironflow_outbox_dead_letter_count | Gauge | env | Current rows in the outbox dead-letter table per environment (computed at scrape time — zero drift across nodes and restarts). See the Outbox DLQ runbook. |
ironflow_outbox_dlq_collector_errors_total | Counter | — | Scrape-time failures of the DLQ count query. Non-zero means the gauge above is stale; distinguishes “DLQ is empty” from “collector can’t query the DB”. |
ironflow_dlq_writes_total | Counter | source | DLQ moves since process start. source="outbox" fires every time the outbox worker exhausts the retry budget (default 10) and moves a row to outbox_dead_letter. |
Quick Start
# Enable metricsIRONFLOW_METRICS_ENABLED=true ./ironflow serve
# Verify metrics endpointcurl http://localhost:9123/metricsThe /metrics endpoint does not require authentication.
Docker Compose with Prometheus
Use the monitoring profile to start Prometheus alongside Ironflow:
docker compose --profile monitoring upPrometheus will be available at http://localhost:9090 and automatically scrapes Ironflow metrics.
Grafana Integration
Point Grafana to your Prometheus instance and use these example queries:
# Request rate (per second)rate(ironflow_http_requests_total[5m])
# Run completion rate by functionrate(ironflow_runs_total{status="completed"}[5m])
# P95 run durationhistogram_quantile(0.95, rate(ironflow_run_duration_seconds_bucket[5m]))
# Active runsironflow_active_runs
# Error raterate(ironflow_runs_total{status="failed"}[5m]) / rate(ironflow_runs_total[5m])OpenTelemetry Tracing
When IRONFLOW_OTEL_ENDPOINT is set, Ironflow exports trace data via OTLP gRPC.
Span Hierarchy
HTTP Request (server span) └── Run: {functionID} (internal span) ├── Step: {stepName} (internal span) ├── Step: {stepName} (internal span) └── Step: {stepName} (internal span)Each span includes attributes:
- Run spans:
run.id,function.id,function.name - Step spans:
step.id,step.name,step.type,run.id - HTTP spans:
http.request.method,url.path,http.response.status_code
Quick Start with Jaeger
# Start Jaeger (all-in-one)docker run -d --name jaeger \ -p 16686:16686 \ -p 4317:4317 \ jaegertracing/jaeger:latest
# Start Ironflow with tracingIRONFLOW_OTEL_ENDPOINT=localhost:4317 ./ironflow serveView traces at http://localhost:16686.
Sampling
Control trace sampling with IRONFLOW_OTEL_SAMPLE_RATE:
1.0— Sample all traces (development)0.1— Sample 10% of traces (production)0.01— Sample 1% of traces (high-traffic production)
Parent-based sampling is used: if an incoming request carries a sampled trace context, it will always be sampled regardless of the rate.
Instrumenting Your Functions
Ironflow automatically propagates trace context to your functions. Here’s how to use it in your code.
Trace Context Propagation
When tracing is enabled, Ironflow injects W3C trace context headers (traceparent, tracestate) into every HTTP request sent to your push-mode functions. This means your function’s spans automatically appear as children of Ironflow’s run span — no manual correlation needed.
Ironflow Server Your Function────────────── ──────────────[Run: order-processor] ──HTTP POST──▶ [handle-order] ├── [Step: validate] ├── [query-database] ├── [Step: charge] └── [send-notification] └── [Step: fulfill]Both sides share the same trace ID, giving you a single end-to-end trace view in Jaeger, Grafana Tempo, or any OTLP-compatible backend.
Push Mode (Next.js, Express, Lambda)
Your function receives traceparent and tracestate as standard HTTP headers alongside Ironflow headers:
| Header | Description |
|---|---|
traceparent | W3C trace context (trace ID, span ID, flags) |
tracestate | Vendor-specific trace data |
X-Ironflow-Run-ID | The current run ID |
X-Ironflow-Function-ID | The function being executed |
X-Ironflow-Attempt | Current retry attempt number |
To create child spans in your function, initialize the OTel SDK with W3C propagation and extract the context from incoming headers. Your framework’s OTel instrumentation typically handles this automatically.
Set up OTel instrumentation:
import { registerOTel } from "@vercel/otel";
registerOTel({ serviceName: "my-nextjs-app",});Create spans inside your function — they automatically become children of the Ironflow run span:
import { serve, createFunction } from "@ironflow/node";import { trace } from "@opentelemetry/api";
const myFunction = createFunction( { id: "order-processor", }, async ({ step }) => { const tracer = trace.getTracer("my-app"); const result = await tracer.startActiveSpan("process-order", async (span) => { try { const order = await step.run("validate", async () => { // your logic }); return order; } finally { span.end(); } }); return result; },);
export const POST = serve({ functions: [myFunction] });Initialize the OTel SDK before any other imports:
// tracing.ts (import before anything else)import { NodeSDK } from "@opentelemetry/sdk-node";import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
const sdk = new NodeSDK({ serviceName: "my-express-app", traceExporter: new OTLPTraceExporter({ url: "http://localhost:4317" }), instrumentations: [getNodeAutoInstrumentations()],});sdk.start();The Express HTTP instrumentation automatically extracts the traceparent header, so any spans you create inside your function handler are correlated with the Ironflow trace.
Extract the trace context from Ironflow’s headers and create child spans:
import ( "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/propagation")
func handler(w http.ResponseWriter, r *http.Request) { // Extract trace context from Ironflow's headers propagator := otel.GetTextMapPropagator() ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))
tracer := otel.Tracer("my-service") ctx, span := tracer.Start(ctx, "handle-order") defer span.End()
// Your logic — this span is a child of the Ironflow run span}Pull Mode (Workers)
Pull-mode workers use gRPC streaming. Trace context propagation for pull mode is planned for a future release. In the meantime, you can manually correlate traces using the run ID and function ID from the job assignment.
Without OTel
If you don’t use OpenTelemetry, the trace headers are harmless — they’re standard HTTP headers that your framework will ignore. You can still use the X-Ironflow-* headers for logging and correlation:
export const POST = serve({ functions: [ createFunction( { id: "my-function" }, async ({ event, step, run }) => { console.log(`[run=${run.id}] Processing ${event.name}`); // ... }, ), ],});func handler(w http.ResponseWriter, r *http.Request) { runID := r.Header.Get("X-Ironflow-Run-ID") functionID := r.Header.Get("X-Ironflow-Function-ID") log.Printf("[run=%s fn=%s] Processing request", runID, functionID) // ...}How It Works
- Startup: Ironflow reads observability config from environment variables
- Metrics: When enabled, a dedicated Prometheus registry collects metrics from the engine, step manager, event publisher, and HTTP middleware
- Tracing: When configured, the OTel SDK creates a
TracerProviderwith OTLP export and W3C context propagation - Middleware: Every HTTP request gets an automatic trace span and metrics recording
- Engine: Run execution creates parent spans; step execution creates child spans
- Shutdown: The tracer provider flushes pending spans on graceful shutdown
CPU and Memory Profiling
Ironflow supports Go’s pprof profiling via a separate debug listener. Enable it with the --pprof flag:
ironflow serve --pprofThis starts pprof handlers on 127.0.0.1:6060 (localhost only), isolated from the main server’s auth middleware. Capture profiles during load testing:
# Heap profilego tool pprof http://localhost:6060/debug/pprof/heap
# CPU profile (30 seconds)go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Goroutine dumpgo tool pprof http://localhost:6060/debug/pprof/goroutine
# Compare heap before/after loadgo tool pprof -diff_base heap-before.prof heap-after.profThe make loadtest command captures heap and goroutine profiles automatically before and after each load test run. See Benchmarks for the full workflow.
Health Endpoints
Ironflow provides two health endpoints for Kubernetes probes:
| Endpoint | Purpose | Checks | Auth |
|---|---|---|---|
/health | Liveness probe | PostgreSQL connectivity | No |
/ready | Readiness probe | PostgreSQL + NATS connectivity | No |
The /ready endpoint uses a 2-second timeout on database checks and skips the NATS check when running in dev mode (embedded NATS). In production Helm deployments, the readiness probe points to /ready and the liveness probe points to /health. This prevents NATS transient blips from cascading into pod restarts.
Structured Logging
The serve command outputs JSON-formatted logs by default for production log pipeline compatibility (Loki, CloudWatch, ELK). All other CLI commands use human-readable console output.
| Variable | Description | Default |
|---|---|---|
LOG_FORMAT | Set to text to force human-readable output for serve | JSON for serve, text for CLI |
LOG_LEVEL | trace, debug, info, warn, error | info for serve, warn for CLI |
Grafana Dashboards
Four pre-built Grafana dashboards are included in the Helm chart at deploy/helm/ironflow/dashboards/:
- ironflow-performance.json — Run throughput, success/failure rates, latency histograms, worker metrics
- k8s-infrastructure.json — Pod CPU/memory, node status, volume usage, restart rates
- nats-monitoring.json — JetStream storage, throughput, consumer lag, slow consumers
- postgres-cnpg.json — Connection count, query performance, cache hit ratio, replication lag
Dashboards are deployed as Kubernetes ConfigMaps when monitoring.dashboards.enabled=true. Grafana auto-imports them via sidecar (label grafana_dashboard: "1"). For standalone Grafana, import the raw JSON files directly.
Requires IRONFLOW_METRICS_ENABLED=true.
Production Monitoring Stack
Ironflow’s monitoring stack is deployed via the Helm chart and CLI prerequisites. Set monitoring.dashboards.enabled=true and monitoring.alerts.enabled=true in your values file (enabled by default in small/medium/large/multi-tenant templates). See _internal/plans/simplified-sre-observability-stack.md for the full architecture.
Components:
- kube-prometheus-stack (CLI prerequisite) — Prometheus, Grafana, Alertmanager, kube-state-metrics
- BlackBox Exporter (bootstrap.sh only) — Synthetic HTTP/TCP probes for /health, NATS, and PostgreSQL
- Healthchecks.io — External dead man’s switch (detects total cluster failure)
Alerts: 17 PrometheusRule alerts across three groups — ironflow.critical (7 rules), ironflow.warning (4 rules), and ironflow.sre (6 rules with mixed severities) — covering pod health, error rates, latency, NATS/PG connectivity, disk space, memory pressure, worker status, circuit breakers, DLQ backlog, projection lag, and subscription drops. Deployed by the Helm chart via templates/ironflow-alerts.yaml.
Additional alerts: NATS-specific alerts (templates/nats-alerts.yaml, 4 rules) and PostgreSQL alerts (templates/pg-alerts.yaml, 8 rules) are also in the Helm chart.
Runbooks: Available at _internal/runbooks/ for critical alerts.
Deployment: ironflow deploy --template medium --name staging installs kube-prometheus-stack as a prerequisite and deploys dashboards + alerts via Helm. The Hetzner bootstrap script (deploy/helm/bootstrap.sh) also supports the full stack. Use --skip-monitoring to deploy without it.
What’s Next?
- Benchmarks — run and interpret performance benchmarks
- Workflows — learn about functions, steps, and execution modes
- Architecture — system design and component overview
- Configuration — full environment variable reference