Skip to content

Observability

Ironflow provides built-in observability through OpenTelemetry distributed tracing and Prometheus metrics. Both are optional — zero overhead when disabled.

Key Concepts

FeatureDescription
Prometheus MetricsCounter, histogram, and gauge metrics for runs, steps, events, and HTTP requests
OpenTelemetry TracingDistributed trace spans for workflow runs, steps, and HTTP requests
W3C Trace ContextAutomatic propagation of traceparent/tracestate headers
Zero OverheadNo performance impact when observability is disabled

Configuration

Configure observability via environment variables:

VariableDescriptionDefault
IRONFLOW_METRICS_ENABLEDEnable Prometheus metrics at /metricsfalse
IRONFLOW_OTEL_ENDPOINTOTLP gRPC endpoint for tracing (empty = disabled)(empty)
IRONFLOW_OTEL_SAMPLE_RATETrace sampling rate (0.0 to 1.0)1.0
IRONFLOW_OTEL_SERVICE_NAMEService name in trace dataironflow
IRONFLOW_OTEL_INSECUREUse plaintext gRPC for OTLP export (set false for TLS)true

Prometheus Metrics

When IRONFLOW_METRICS_ENABLED=true, the /metrics endpoint serves Prometheus exposition format on the main API port.

Available Metrics

MetricTypeLabelsDescription
ironflow_runs_totalCounterfunction_id, status, environment, failure_causeTotal workflow runs. failure_cause is platform or user for failed runs, empty for completed.
ironflow_run_duration_secondsHistogramfunction_id, environmentRun execution duration
ironflow_steps_totalCounterfunction_id, step_name, statusTotal step executions
ironflow_step_duration_secondsHistogramfunction_id, step_nameStep execution duration
ironflow_events_emitted_totalCounterevent_name, environmentEvents emitted
ironflow_active_runsGaugefunction_id, environmentCurrently active runs
ironflow_http_requests_totalCountermethod, path, status_codeHTTP requests
ironflow_http_request_duration_secondsHistogrammethod, pathHTTP request duration
ironflow_workers_connectedGaugeCurrently connected pull-mode workers
ironflow_worker_disconnects_totalCounterreasonWorker disconnections by reason
ironflow_worker_active_jobsGaugeActive jobs across all workers
ironflow_outbox_dead_letter_countGaugeenvCurrent rows in the outbox dead-letter table per environment (computed at scrape time — zero drift across nodes and restarts). See the Outbox DLQ runbook.
ironflow_outbox_dlq_collector_errors_totalCounterScrape-time failures of the DLQ count query. Non-zero means the gauge above is stale; distinguishes “DLQ is empty” from “collector can’t query the DB”.
ironflow_dlq_writes_totalCountersourceDLQ moves since process start. source="outbox" fires every time the outbox worker exhausts the retry budget (default 10) and moves a row to outbox_dead_letter.

Quick Start

Terminal window
# Enable metrics
IRONFLOW_METRICS_ENABLED=true ./ironflow serve
# Verify metrics endpoint
curl http://localhost:9123/metrics

The /metrics endpoint does not require authentication.

Docker Compose with Prometheus

Use the monitoring profile to start Prometheus alongside Ironflow:

Terminal window
docker compose --profile monitoring up

Prometheus will be available at http://localhost:9090 and automatically scrapes Ironflow metrics.

Grafana Integration

Point Grafana to your Prometheus instance and use these example queries:

# Request rate (per second)
rate(ironflow_http_requests_total[5m])
# Run completion rate by function
rate(ironflow_runs_total{status="completed"}[5m])
# P95 run duration
histogram_quantile(0.95, rate(ironflow_run_duration_seconds_bucket[5m]))
# Active runs
ironflow_active_runs
# Error rate
rate(ironflow_runs_total{status="failed"}[5m]) / rate(ironflow_runs_total[5m])

OpenTelemetry Tracing

When IRONFLOW_OTEL_ENDPOINT is set, Ironflow exports trace data via OTLP gRPC.

Span Hierarchy

HTTP Request (server span)
└── Run: {functionID} (internal span)
├── Step: {stepName} (internal span)
├── Step: {stepName} (internal span)
└── Step: {stepName} (internal span)

Each span includes attributes:

  • Run spans: run.id, function.id, function.name
  • Step spans: step.id, step.name, step.type, run.id
  • HTTP spans: http.request.method, url.path, http.response.status_code

Quick Start with Jaeger

Terminal window
# Start Jaeger (all-in-one)
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/jaeger:latest
# Start Ironflow with tracing
IRONFLOW_OTEL_ENDPOINT=localhost:4317 ./ironflow serve

View traces at http://localhost:16686.

Sampling

Control trace sampling with IRONFLOW_OTEL_SAMPLE_RATE:

  • 1.0 — Sample all traces (development)
  • 0.1 — Sample 10% of traces (production)
  • 0.01 — Sample 1% of traces (high-traffic production)

Parent-based sampling is used: if an incoming request carries a sampled trace context, it will always be sampled regardless of the rate.


Instrumenting Your Functions

Ironflow automatically propagates trace context to your functions. Here’s how to use it in your code.

Trace Context Propagation

When tracing is enabled, Ironflow injects W3C trace context headers (traceparent, tracestate) into every HTTP request sent to your push-mode functions. This means your function’s spans automatically appear as children of Ironflow’s run span — no manual correlation needed.

Ironflow Server Your Function
────────────── ──────────────
[Run: order-processor] ──HTTP POST──▶ [handle-order]
├── [Step: validate] ├── [query-database]
├── [Step: charge] └── [send-notification]
└── [Step: fulfill]

Both sides share the same trace ID, giving you a single end-to-end trace view in Jaeger, Grafana Tempo, or any OTLP-compatible backend.

Push Mode (Next.js, Express, Lambda)

Your function receives traceparent and tracestate as standard HTTP headers alongside Ironflow headers:

HeaderDescription
traceparentW3C trace context (trace ID, span ID, flags)
tracestateVendor-specific trace data
X-Ironflow-Run-IDThe current run ID
X-Ironflow-Function-IDThe function being executed
X-Ironflow-AttemptCurrent retry attempt number

To create child spans in your function, initialize the OTel SDK with W3C propagation and extract the context from incoming headers. Your framework’s OTel instrumentation typically handles this automatically.

Set up OTel instrumentation:

instrumentation.ts
import { registerOTel } from "@vercel/otel";
registerOTel({
serviceName: "my-nextjs-app",
});

Create spans inside your function — they automatically become children of the Ironflow run span:

app/api/ironflow/route.ts
import { serve, createFunction } from "@ironflow/node";
import { trace } from "@opentelemetry/api";
const myFunction = createFunction(
{
id: "order-processor",
},
async ({ step }) => {
const tracer = trace.getTracer("my-app");
const result = await tracer.startActiveSpan("process-order", async (span) => {
try {
const order = await step.run("validate", async () => {
// your logic
});
return order;
} finally {
span.end();
}
});
return result;
},
);
export const POST = serve({ functions: [myFunction] });

Pull Mode (Workers)

Pull-mode workers use gRPC streaming. Trace context propagation for pull mode is planned for a future release. In the meantime, you can manually correlate traces using the run ID and function ID from the job assignment.

Without OTel

If you don’t use OpenTelemetry, the trace headers are harmless — they’re standard HTTP headers that your framework will ignore. You can still use the X-Ironflow-* headers for logging and correlation:

export const POST = serve({
functions: [
createFunction(
{ id: "my-function" },
async ({ event, step, run }) => {
console.log(`[run=${run.id}] Processing ${event.name}`);
// ...
},
),
],
});

How It Works

  1. Startup: Ironflow reads observability config from environment variables
  2. Metrics: When enabled, a dedicated Prometheus registry collects metrics from the engine, step manager, event publisher, and HTTP middleware
  3. Tracing: When configured, the OTel SDK creates a TracerProvider with OTLP export and W3C context propagation
  4. Middleware: Every HTTP request gets an automatic trace span and metrics recording
  5. Engine: Run execution creates parent spans; step execution creates child spans
  6. Shutdown: The tracer provider flushes pending spans on graceful shutdown

CPU and Memory Profiling

Ironflow supports Go’s pprof profiling via a separate debug listener. Enable it with the --pprof flag:

Terminal window
ironflow serve --pprof

This starts pprof handlers on 127.0.0.1:6060 (localhost only), isolated from the main server’s auth middleware. Capture profiles during load testing:

Terminal window
# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# CPU profile (30 seconds)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Goroutine dump
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Compare heap before/after load
go tool pprof -diff_base heap-before.prof heap-after.prof

The make loadtest command captures heap and goroutine profiles automatically before and after each load test run. See Benchmarks for the full workflow.

Health Endpoints

Ironflow provides two health endpoints for Kubernetes probes:

EndpointPurposeChecksAuth
/healthLiveness probePostgreSQL connectivityNo
/readyReadiness probePostgreSQL + NATS connectivityNo

The /ready endpoint uses a 2-second timeout on database checks and skips the NATS check when running in dev mode (embedded NATS). In production Helm deployments, the readiness probe points to /ready and the liveness probe points to /health. This prevents NATS transient blips from cascading into pod restarts.

Structured Logging

The serve command outputs JSON-formatted logs by default for production log pipeline compatibility (Loki, CloudWatch, ELK). All other CLI commands use human-readable console output.

VariableDescriptionDefault
LOG_FORMATSet to text to force human-readable output for serveJSON for serve, text for CLI
LOG_LEVELtrace, debug, info, warn, errorinfo for serve, warn for CLI

Grafana Dashboards

Four pre-built Grafana dashboards are included in the Helm chart at deploy/helm/ironflow/dashboards/:

  • ironflow-performance.json — Run throughput, success/failure rates, latency histograms, worker metrics
  • k8s-infrastructure.json — Pod CPU/memory, node status, volume usage, restart rates
  • nats-monitoring.json — JetStream storage, throughput, consumer lag, slow consumers
  • postgres-cnpg.json — Connection count, query performance, cache hit ratio, replication lag

Dashboards are deployed as Kubernetes ConfigMaps when monitoring.dashboards.enabled=true. Grafana auto-imports them via sidecar (label grafana_dashboard: "1"). For standalone Grafana, import the raw JSON files directly.

Requires IRONFLOW_METRICS_ENABLED=true.

Production Monitoring Stack

Ironflow’s monitoring stack is deployed via the Helm chart and CLI prerequisites. Set monitoring.dashboards.enabled=true and monitoring.alerts.enabled=true in your values file (enabled by default in small/medium/large/multi-tenant templates). See _internal/plans/simplified-sre-observability-stack.md for the full architecture.

Components:

  • kube-prometheus-stack (CLI prerequisite) — Prometheus, Grafana, Alertmanager, kube-state-metrics
  • BlackBox Exporter (bootstrap.sh only) — Synthetic HTTP/TCP probes for /health, NATS, and PostgreSQL
  • Healthchecks.io — External dead man’s switch (detects total cluster failure)

Alerts: 17 PrometheusRule alerts across three groups — ironflow.critical (7 rules), ironflow.warning (4 rules), and ironflow.sre (6 rules with mixed severities) — covering pod health, error rates, latency, NATS/PG connectivity, disk space, memory pressure, worker status, circuit breakers, DLQ backlog, projection lag, and subscription drops. Deployed by the Helm chart via templates/ironflow-alerts.yaml.

Additional alerts: NATS-specific alerts (templates/nats-alerts.yaml, 4 rules) and PostgreSQL alerts (templates/pg-alerts.yaml, 8 rules) are also in the Helm chart.

Runbooks: Available at _internal/runbooks/ for critical alerts.

Deployment: ironflow deploy --template medium --name staging installs kube-prometheus-stack as a prerequisite and deploys dashboards + alerts via Helm. The Hetzner bootstrap script (deploy/helm/bootstrap.sh) also supports the full stack. Use --skip-monitoring to deploy without it.


What’s Next?

  • Benchmarks — run and interpret performance benchmarks
  • Workflows — learn about functions, steps, and execution modes
  • Architecture — system design and component overview
  • Configuration — full environment variable reference