Observability

Ironflow provides built-in observability through OpenTelemetry distributed tracing and Prometheus metrics. Both are optional — zero overhead when disabled.

Key Concepts

Feature	Description
Prometheus Metrics	Counter, histogram, and gauge metrics for runs, steps, events, and HTTP requests
OpenTelemetry Tracing	Distributed trace spans for workflow runs, steps, and HTTP requests
W3C Trace Context	Automatic propagation of `traceparent`/`tracestate` headers
Zero Overhead	No performance impact when observability is disabled

Configuration

Configure observability via environment variables:

Variable	Description	Default
`IRONFLOW_METRICS_ENABLED`	Enable Prometheus metrics at `/metrics`	`false`
`IRONFLOW_OTEL_ENDPOINT`	OTLP gRPC endpoint for tracing (empty = disabled)	(empty)
`IRONFLOW_OTEL_SAMPLE_RATE`	Trace sampling rate (0.0 to 1.0)	`1.0`
`IRONFLOW_OTEL_SERVICE_NAME`	Service name in trace data	`ironflow`
`IRONFLOW_OTEL_INSECURE`	Use plaintext gRPC for OTLP export (set `false` for TLS)	`true`

Prometheus Metrics

When IRONFLOW_METRICS_ENABLED=true, the /metrics endpoint serves Prometheus exposition format on the main API port.

Available Metrics

Metric	Type	Labels	Description
`ironflow_runs_total`	Counter	`function_id`, `status`, `environment`, `failure_cause`	Total workflow runs. `failure_cause` is `platform` or `user` for failed runs, empty for completed.
`ironflow_run_duration_seconds`	Histogram	`function_id`, `environment`	Run execution duration
`ironflow_steps_total`	Counter	`function_id`, `step_name`, `status`	Total step executions
`ironflow_step_duration_seconds`	Histogram	`function_id`, `step_name`	Step execution duration
`ironflow_events_emitted_total`	Counter	`event_name`, `environment`	Events emitted
`ironflow_active_runs`	Gauge	`function_id`, `environment`	Currently active runs
`ironflow_http_requests_total`	Counter	`method`, `path`, `status_code`	HTTP requests
`ironflow_http_request_duration_seconds`	Histogram	`method`, `path`	HTTP request duration
`ironflow_workers_connected`	Gauge	—	Currently connected pull-mode workers
`ironflow_worker_disconnects_total`	Counter	`reason`	Worker disconnections by reason
`ironflow_worker_active_jobs`	Gauge	—	Active jobs across all workers
`ironflow_outbox_dead_letter_count`	Gauge	`env`	Current rows in the outbox dead-letter table per environment (computed at scrape time — zero drift across nodes and restarts). See the Outbox DLQ runbook.
`ironflow_outbox_dlq_collector_errors_total`	Counter	—	Scrape-time failures of the DLQ count query. Non-zero means the gauge above is stale; distinguishes “DLQ is empty” from “collector can’t query the DB”.
`ironflow_dlq_writes_total`	Counter	`source`	DLQ moves since process start. `source="outbox"` fires every time the outbox worker exhausts the retry budget (default 10) and moves a row to `outbox_dead_letter`.

Quick Start

# Enable metrics
IRONFLOW_METRICS_ENABLED=true ./ironflow serve

# Verify metrics endpoint
curl http://localhost:9123/metrics

The /metrics endpoint does not require authentication.

Docker Compose with Prometheus

Use the monitoring profile to start Prometheus alongside Ironflow:

docker compose --profile monitoring up

Prometheus will be available at http://localhost:9090 and automatically scrapes Ironflow metrics.

Grafana Integration

Point Grafana to your Prometheus instance and use these example queries:

# Request rate (per second)
rate(ironflow_http_requests_total[5m])

# Run completion rate by function
rate(ironflow_runs_total{status="completed"}[5m])

# P95 run duration
histogram_quantile(0.95, rate(ironflow_run_duration_seconds_bucket[5m]))

# Active runs
ironflow_active_runs

# Error rate
rate(ironflow_runs_total{status="failed"}[5m]) / rate(ironflow_runs_total[5m])

OpenTelemetry Tracing

When IRONFLOW_OTEL_ENDPOINT is set, Ironflow exports trace data via OTLP gRPC.

Span Hierarchy

HTTP Request (server span)
  └── Run: {functionID} (internal span)
       ├── Step: {stepName} (internal span)
       ├── Step: {stepName} (internal span)
       └── Step: {stepName} (internal span)

Each span includes attributes:

Run spans: run.id, function.id, function.name
Step spans: step.id, step.name, step.type, run.id
HTTP spans: http.request.method, url.path, http.response.status_code

Quick Start with Jaeger

# Start Jaeger (all-in-one)
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/jaeger:latest

# Start Ironflow with tracing
IRONFLOW_OTEL_ENDPOINT=localhost:4317 ./ironflow serve

View traces at http://localhost:16686.

Sampling

Control trace sampling with IRONFLOW_OTEL_SAMPLE_RATE:

1.0 — Sample all traces (development)
0.1 — Sample 10% of traces (production)
0.01 — Sample 1% of traces (high-traffic production)

Parent-based sampling is used: if an incoming request carries a sampled trace context, it will always be sampled regardless of the rate.

Instrumenting Your Functions

Ironflow automatically propagates trace context to your functions. Here’s how to use it in your code.

Trace Context Propagation

When tracing is enabled, Ironflow injects W3C trace context headers (traceparent, tracestate) into every HTTP request sent to your push-mode functions. This means your function’s spans automatically appear as children of Ironflow’s run span — no manual correlation needed.

Ironflow Server                          Your Function
──────────────                          ──────────────
[Run: order-processor]  ──HTTP POST──▶  [handle-order]
  ├── [Step: validate]                    ├── [query-database]
  ├── [Step: charge]                      └── [send-notification]
  └── [Step: fulfill]

Both sides share the same trace ID, giving you a single end-to-end trace view in Jaeger, Grafana Tempo, or any OTLP-compatible backend.

Push Mode (Next.js, Express, Lambda)

Your function receives traceparent and tracestate as standard HTTP headers alongside Ironflow headers:

Header	Description
`traceparent`	W3C trace context (trace ID, span ID, flags)
`tracestate`	Vendor-specific trace data
`X-Ironflow-Run-ID`	The current run ID
`X-Ironflow-Function-ID`	The function being executed
`X-Ironflow-Attempt`	Current retry attempt number

To create child spans in your function, initialize the OTel SDK with W3C propagation and extract the context from incoming headers. Your framework’s OTel instrumentation typically handles this automatically.

Set up OTel instrumentation:

import { registerOTel } from "@vercel/otel";

registerOTel({
  serviceName: "my-nextjs-app",
});

Create spans inside your function — they automatically become children of the Ironflow run span:

import { serve, createFunction } from "@ironflow/node";
import { trace } from "@opentelemetry/api";

const myFunction = createFunction(
  {
    id: "order-processor",
  },
  async ({ step }) => {
    const tracer = trace.getTracer("my-app");
    const result = await tracer.startActiveSpan("process-order", async (span) => {
      try {
        const order = await step.run("validate", async () => {
          // your logic
        });
        return order;
      } finally {
        span.end();
      }
    });
    return result;
  },
);

export const POST = serve({ functions: [myFunction] });

Initialize the OTel SDK before any other imports:

// tracing.ts (import before anything else)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";

const sdk = new NodeSDK({
  serviceName: "my-express-app",
  traceExporter: new OTLPTraceExporter({ url: "http://localhost:4317" }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

The Express HTTP instrumentation automatically extracts the traceparent header, so any spans you create inside your function handler are correlated with the Ironflow trace.

Extract the trace context from Ironflow’s headers and create child spans:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
)

func handler(w http.ResponseWriter, r *http.Request) {
    // Extract trace context from Ironflow's headers
    propagator := otel.GetTextMapPropagator()
    ctx := propagator.Extract(r.Context(), propagation.HeaderCarrier(r.Header))

    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "handle-order")
    defer span.End()

    // Your logic — this span is a child of the Ironflow run span
}

Pull Mode (Workers)

Pull-mode workers use gRPC streaming. Trace context propagation for pull mode is planned for a future release. In the meantime, you can manually correlate traces using the run ID and function ID from the job assignment.

Without OTel

If you don’t use OpenTelemetry, the trace headers are harmless — they’re standard HTTP headers that your framework will ignore. You can still use the X-Ironflow-* headers for logging and correlation:

TypeScript
Go

export const POST = serve({
  functions: [
    createFunction(
      { id: "my-function" },
      async ({ event, step, run }) => {
        console.log(`[run=${run.id}] Processing ${event.name}`);
        // ...
      },
    ),
  ],
});

func handler(w http.ResponseWriter, r *http.Request) {
    runID := r.Header.Get("X-Ironflow-Run-ID")
    functionID := r.Header.Get("X-Ironflow-Function-ID")
    log.Printf("[run=%s fn=%s] Processing request", runID, functionID)
    // ...
}

How It Works

Startup: Ironflow reads observability config from environment variables
Metrics: When enabled, a dedicated Prometheus registry collects metrics from the engine, step manager, event publisher, and HTTP middleware
Tracing: When configured, the OTel SDK creates a TracerProvider with OTLP export and W3C context propagation
Middleware: Every HTTP request gets an automatic trace span and metrics recording
Engine: Run execution creates parent spans; step execution creates child spans
Shutdown: The tracer provider flushes pending spans on graceful shutdown

CPU and Memory Profiling

Ironflow supports Go’s pprof profiling via a separate debug listener. Enable it with the --pprof flag:

ironflow serve --pprof

This starts pprof handlers on 127.0.0.1:6060 (localhost only), isolated from the main server’s auth middleware. Capture profiles during load testing:

# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# CPU profile (30 seconds)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Goroutine dump
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Compare heap before/after load
go tool pprof -diff_base heap-before.prof heap-after.prof

The make loadtest command captures heap and goroutine profiles automatically before and after each load test run. See Benchmarks for the full workflow.

Health Endpoints

Ironflow provides two health endpoints for Kubernetes probes:

Endpoint	Purpose	Checks	Auth
`/health`	Liveness probe	PostgreSQL connectivity	No
`/ready`	Readiness probe	PostgreSQL + NATS connectivity	No

The /ready endpoint uses a 2-second timeout on database checks and skips the NATS check when running in dev mode (embedded NATS). In production Helm deployments, the readiness probe points to /ready and the liveness probe points to /health. This prevents NATS transient blips from cascading into pod restarts.

Structured Logging

The serve command outputs JSON-formatted logs by default for production log pipeline compatibility (Loki, CloudWatch, ELK). All other CLI commands use human-readable console output.

Variable	Description	Default
`LOG_FORMAT`	Set to `text` to force human-readable output for `serve`	JSON for serve, text for CLI
`LOG_LEVEL`	`trace`, `debug`, `info`, `warn`, `error`	`info` for serve, `warn` for CLI

Grafana Dashboards

Four pre-built Grafana dashboards are included in the Helm chart at deploy/helm/ironflow/dashboards/:

ironflow-performance.json — Run throughput, success/failure rates, latency histograms, worker metrics
k8s-infrastructure.json — Pod CPU/memory, node status, volume usage, restart rates
nats-monitoring.json — JetStream storage, throughput, consumer lag, slow consumers
postgres-cnpg.json — Connection count, query performance, cache hit ratio, replication lag

Dashboards are deployed as Kubernetes ConfigMaps when monitoring.dashboards.enabled=true. Grafana auto-imports them via sidecar (label grafana_dashboard: "1"). For standalone Grafana, import the raw JSON files directly.

Requires IRONFLOW_METRICS_ENABLED=true.

Production Monitoring Stack

Ironflow’s monitoring stack is deployed via the Helm chart and CLI prerequisites. Set monitoring.dashboards.enabled=true and monitoring.alerts.enabled=true in your values file (enabled by default in small/medium/large/multi-tenant templates). See _internal/plans/simplified-sre-observability-stack.md for the full architecture.

Components:

kube-prometheus-stack (CLI prerequisite) — Prometheus, Grafana, Alertmanager, kube-state-metrics
BlackBox Exporter (bootstrap.sh only) — Synthetic HTTP/TCP probes for /health, NATS, and PostgreSQL
Healthchecks.io — External dead man’s switch (detects total cluster failure)

Alerts: 17 PrometheusRule alerts across three groups — ironflow.critical (7 rules), ironflow.warning (4 rules), and ironflow.sre (6 rules with mixed severities) — covering pod health, error rates, latency, NATS/PG connectivity, disk space, memory pressure, worker status, circuit breakers, DLQ backlog, projection lag, and subscription drops. Deployed by the Helm chart via templates/ironflow-alerts.yaml.

Additional alerts: NATS-specific alerts (templates/nats-alerts.yaml, 4 rules) and PostgreSQL alerts (templates/pg-alerts.yaml, 8 rules) are also in the Helm chart.

Runbooks: Available at _internal/runbooks/ for critical alerts.

Deployment: ironflow deploy --template medium --name staging installs kube-prometheus-stack as a prerequisite and deploys dashboards + alerts via Helm. The Hetzner bootstrap script (deploy/helm/bootstrap.sh) also supports the full stack. Use --skip-monitoring to deploy without it.

What’s Next?

Benchmarks — run and interpret performance benchmarks
Workflows — learn about functions, steps, and execution modes
Architecture — system design and component overview
Configuration — full environment variable reference