Survive a Crash

Ironflow agents are durable — every tool call is memoized as a recorded step, every memory write goes through an event-sourced entity stream. Crash the worker mid-run, restart it, and the agent picks up from the last completed step instead of starting over.

This tutorial walks through the literal demo we run for make demo-agent-crash-resume. Build a 3-step document-processing agent, kill it mid-pipeline, watch it resume.

              ┌──────────────────────────────────────────────┐
              │                                              │
   doc.received ──►  OCR ──► classify ──► publish ──►  done  │
              │      ▲                                       │
              │      │                                       │
              │   kill -9                                    │
              │   right here                                 │
              │      │                                       │
              │      ▼                                       │
              │  worker restarts                             │
              │      │                                       │
              │      ▼                                       │
              │   OCR (cached) ──► classify ──► publish      │
              │                    ▲                         │
              │                    │                         │
              │                resumes from here             │
              └──────────────────────────────────────────────┘

What survives the crash

Agent state	Where it lives	After `kill -9`?
Completed step output	Step memoization table (`runs.steps`)	✅ replayed from cache
Memory writes	Entity stream (`agent-memory`)	✅ projection re-derives
In-flight tool call	Process memory only	❌ re-runs from start
LLM tokens spent	Provider-side	❌ already paid for the partial call

You only re-run the step that was in flight when the worker died. Everything before it is replayed from the cache.

Prerequisites

Ironflow server running locally — ironflow serve --dev at http://localhost:9123
Node.js 20+, pnpm
Two terminal windows
If running the bundled demo via make demo-agent-crash-resume: make build first so build/ironflow exists.

See Installation if the server is not yet running.

Step 1 — Scaffold the agent

The full code is in examples/agents/doc-processor-agent/. To follow along inline:

mkdir -p crash-demo/src && cd crash-demo

package.json:

{
  "name": "crash-demo",
  "private": true,
  "type": "module",
  "scripts": {
    "dev": "tsx watch src/worker.ts",
    "start": "tsx src/worker.ts"
  },
  "dependencies": {
    "@ironflow/node": "*",
    "zod": "^4.3.5"
  },
  "devDependencies": {
    "tsx": "^4.19.0",
    "typescript": "^5.6.0",
    "@types/node": "^22.0.0"
  }
}

pnpm install

Step 2 — Define the tools

src/tools.ts:

import { defineTool } from "@ironflow/node/agent";
import { z } from "zod";

// Slow OCR — simulates a real API. The 3-second sleep is the kill window.
export const ocr = defineTool({
  name: "ocr",
  description: "Extract text from a document image",
  input: z.object({ imageUrl: z.string().url() }),
  handler: async ({ imageUrl }) => {
    console.log(`[ocr] extracting text from ${imageUrl} (3s)...`);
    await new Promise((r) => setTimeout(r, 3000));
    return { text: `Extracted text from ${imageUrl}`, pages: 1 };
  },
});

export const classify = defineTool({
  name: "classify",
  input: z.object({ text: z.string() }),
  handler: async ({ text }) => {
    console.log(`[classify] categorizing ${text.length} chars...`);
    await new Promise((r) => setTimeout(r, 200));
    return { category: text.includes("invoice") ? "invoice" : "other" };
  },
});

export const publish = defineTool({
  name: "publish",
  input: z.object({ docId: z.string(), category: z.string() }),
  handler: async ({ docId, category }) => {
    console.log(`[publish] doc ${docId} → ${category} channel`);
    return { published: true };
  },
});

Step 3 — Memory projection

src/memory.ts:

import { createProjection } from "@ironflow/node";

interface DocState {
  docId: string;
  status: "ocr" | "classified" | "published";
  category?: string;
}

export const docMemory = createProjection({
  name: "doc-processor-memory",
  events: ["doc.processed"],
  initialState: (): Record<string, DocState> => ({}),
  handler: (state, event) => {
    const data = event.data as DocState;
    return { ...state, [data.docId]: data };
  },
});

Step 4 — Wire the agent

src/agent.ts:

import { agent } from "@ironflow/node/agent";
import { ocr, classify, publish } from "./tools.js";

interface DocReceived {
  docId: string;
  imageUrl: string;
}

export const docProcessor = agent(
  {
    id: "doc-processor",
    description: "Durable OCR → classify → publish pipeline",
    triggers: [{ event: "doc.received" }],
    memory: {
      streamId: "agent-memory",
      projection: "doc-processor-memory",
    },
    recording: true,
  },
  async ({ event, tool, memory, logger }) => {
    const { docId, imageUrl } = event.data as DocReceived;
    logger.info(`processing ${docId}`);

    // Each tool() call is a memoized step. Crash here and the
    // worker resumes from the step that was in flight.
    const { text } = await tool(ocr, { imageUrl });
    const { category } = await tool(classify, { text });
    await tool(publish, { docId, category });

    // Memory write is also durable — appends to an entity stream
    // and waits for the projection to catch up before returning.
    await memory.append("doc.processed", {
      docId,
      status: "published",
      category,
    });

    return { docId, category };
  },
);

Step 5 — Worker entrypoint

src/worker.ts:

import { createWorker, type IronflowProjection } from "@ironflow/node";
import { docProcessor } from "./agent.js";
import { docMemory } from "./memory.js";

// createWorker honors an explicit `serverUrl`. Read env so the demo
// script can target a non-default port without editing this file.
// IRONFLOW_URL is the canonical name; IRONFLOW_SERVER_URL is accepted as a fallback.
const serverUrl = process.env.IRONFLOW_URL ?? process.env.IRONFLOW_SERVER_URL;

const worker = createWorker({
  functions: [docProcessor],
  projections: [docMemory as IronflowProjection],
  ...(serverUrl ? { serverUrl } : {}),
});

worker.start().then(() => {
  console.log("doc-processor worker ready (pid", process.pid + ")");
});

Step 6 — Run the happy path

In terminal A start the worker:

pnpm dev

Output:

doc-processor worker ready (pid 41203)

In terminal B trigger a run:

ironflow emit doc.received \
  --data '{"docId":"doc-1","imageUrl":"https://example.com/invoice.png"}'

Terminal A logs:

[ocr] extracting text from https://example.com/invoice.png (3s)...
[classify] categorizing 38 chars...
[publish] doc doc-1 → other channel

Run completes in about 3.2 seconds. Confirm with ironflow run list or curl http://localhost:9123/api/v1/runs | jq '.[0]'.

Step 7 — Crash mid-run

Trigger a fresh run and kill the worker during the 3-second OCR sleep:

# Terminal B
ironflow emit doc.received \
  --data '{"docId":"doc-2","imageUrl":"https://example.com/receipt.png"}'

Terminal A logs the start:

[ocr] extracting text from https://example.com/receipt.png (3s)...

Now in a third terminal — use kill -9 (SIGKILL), not Ctrl+C. SIGINT triggers a graceful drain that completes in-flight steps; only an ungraceful kill leaves the run orphaned, which is what we want to demonstrate:

# Terminal C
kill -9 $(pgrep -f "tsx.*worker.ts")

Terminal A:

[1]  + 41203 killed     pnpm dev

The OCR call was in flight. There is no completed step recorded for it. ironflow run list shows doc-2 as running (orphaned).

Step 8 — Restart and resume

# Terminal A
pnpm dev

Output (after the redelivery wait — see callout below):

doc-processor worker ready (pid 41789)
[ocr] extracting text from https://example.com/receipt.png (3s)...
[classify] categorizing 39 chars...
[publish] doc doc-2 → other channel

The worker pulled the orphaned run, replayed any cached steps, and resumed at the in-flight one. OCR re-ran because it never completed. If we had killed the worker between OCR and classify, OCR would replay from cache and only classify + publish would re-run.

Verify:

ironflow inspect $(ironflow run list --limit 1 --json | jq -r '.[0].id')

You’ll see steps tool.ocr, tool.classify, tool.publish, memory.append, memory.append.wait all marked complete.

Query the memory projection. The HTTP response wraps user state in a ProjectionState envelope, so use .state.state to unwrap:

curl -s http://localhost:9123/api/v1/projections/doc-processor-memory | jq '.state.state'

{
  "doc-1": { "docId": "doc-1", "status": "published", "category": "other" },
  "doc-2": { "docId": "doc-2", "status": "published", "category": "other" }
}

Both docs show through, including the one the worker died on. (The SDK’s client.projections.get() peels the envelope automatically; raw curl does not.)

Why it works

   handler runs                         step.run records the result
       │                                       │
       ▼                                       ▼
   tool(ocr, …)  ────►  step.run("tool.ocr")  ────►  runs.steps row
                                                    (input + output + ts)
                              │
                              ▼
                       crash here?
                              │
                              ├──► YES: row never written.
                              │         resume re-runs the step.
                              │
                              └──► NO:  row written.
                                        resume replays from cache.

Two pieces do the work:

Step memoization — tool(), llm(), approve(), spawn(), memory.append() all wrap step.run(). Each call records (runId, stepId, input, output). On replay the runtime returns cached output for any step whose row exists, and only re-executes the first step without a row.
Entity-stream memory — memory.append() writes a doc.processed event to the agent’s entity stream (agent-memory), then waits for the named projection to catch up. The projection partitions state by docId and is rebuilt from events deterministically, so its state survives any node restart and rebuilds identically.

No new server primitives were introduced for any of this. Tools, LLM calls, approvals, spawns, and memory writes all compose over the same step.run + entity-stream surface that powers Ironflow workflows. Agents inherit crash-resume for free.

Run it with one command

The Makefile target reproduces the full demo:

make demo-agent-crash-resume

The script:

Starts ironflow serve --dev in the background
Builds the SDK and starts the doc-processor worker
Emits doc.received for doc-2
kill -9s the worker after 1.5s (mid-OCR)
Restarts the worker
Waits for the run to finish, asserts memory state contains doc-2
Tears down

make demo-agent-crash-resume is the literal CI gate behind the agent-survives-crash narrative. It runs in .github/workflows/demo-crash-resume.yml whenever the example, this tutorial, or the Makefile changes.

Next steps

Code-review agent — adds llm() + approve() for human-in-the-loop gates.
Doc-processor agent — the full version of this tutorial with real tests.
AI Research agent — multi-step research with parallel search and summarize.
Comparison: where Ironflow sits vs. LangGraph, Claude Agent SDK, Temporal, and others.
@ironflow/langgraph — same crash-resume guarantees for LangGraph graphs.