Skip to content

Survive a Crash

Ironflow agents are durable — every tool call is memoized as a recorded step, every memory write goes through an event-sourced entity stream. Crash the worker mid-run, restart it, and the agent picks up from the last completed step instead of starting over.

This tutorial walks through the literal demo we run for make demo-agent-crash-resume. Build a 3-step document-processing agent, kill it mid-pipeline, watch it resume.

┌──────────────────────────────────────────────┐
│ │
doc.received ──► OCR ──► classify ──► publish ──► done │
│ ▲ │
│ │ │
│ kill -9 │
│ right here │
│ │ │
│ ▼ │
│ worker restarts │
│ │ │
│ ▼ │
│ OCR (cached) ──► classify ──► publish │
│ ▲ │
│ │ │
│ resumes from here │
└──────────────────────────────────────────────┘
Agent stateWhere it livesAfter kill -9?
Completed step outputStep memoization table (runs.steps)✅ replayed from cache
Memory writesEntity stream (agent-memory)✅ projection re-derives
In-flight tool callProcess memory only❌ re-runs from start
LLM tokens spentProvider-side❌ already paid for the partial call

You only re-run the step that was in flight when the worker died. Everything before it is replayed from the cache.

  • Ironflow server running locally — ironflow serve --dev at http://localhost:9123
  • Node.js 20+, pnpm
  • Two terminal windows
  • If running the bundled demo via make demo-agent-crash-resume: make build first so build/ironflow exists.

See Installation if the server is not yet running.

The full code is in examples/agents/doc-processor-agent/. To follow along inline:

Terminal window
mkdir -p crash-demo/src && cd crash-demo

package.json:

{
"name": "crash-demo",
"private": true,
"type": "module",
"scripts": {
"dev": "tsx watch src/worker.ts",
"start": "tsx src/worker.ts"
},
"dependencies": {
"@ironflow/node": "*",
"zod": "^4.3.5"
},
"devDependencies": {
"tsx": "^4.19.0",
"typescript": "^5.6.0",
"@types/node": "^22.0.0"
}
}
Terminal window
pnpm install

src/tools.ts:

import { defineTool } from "@ironflow/node/agent";
import { z } from "zod";
// Slow OCR — simulates a real API. The 3-second sleep is the kill window.
export const ocr = defineTool({
name: "ocr",
description: "Extract text from a document image",
input: z.object({ imageUrl: z.string().url() }),
handler: async ({ imageUrl }) => {
console.log(`[ocr] extracting text from ${imageUrl} (3s)...`);
await new Promise((r) => setTimeout(r, 3000));
return { text: `Extracted text from ${imageUrl}`, pages: 1 };
},
});
export const classify = defineTool({
name: "classify",
input: z.object({ text: z.string() }),
handler: async ({ text }) => {
console.log(`[classify] categorizing ${text.length} chars...`);
await new Promise((r) => setTimeout(r, 200));
return { category: text.includes("invoice") ? "invoice" : "other" };
},
});
export const publish = defineTool({
name: "publish",
input: z.object({ docId: z.string(), category: z.string() }),
handler: async ({ docId, category }) => {
console.log(`[publish] doc ${docId}${category} channel`);
return { published: true };
},
});

src/memory.ts:

import { createProjection } from "@ironflow/node";
interface DocState {
docId: string;
status: "ocr" | "classified" | "published";
category?: string;
}
export const docMemory = createProjection({
name: "doc-processor-memory",
events: ["doc.processed"],
initialState: (): Record<string, DocState> => ({}),
handler: (state, event) => {
const data = event.data as DocState;
return { ...state, [data.docId]: data };
},
});

src/agent.ts:

import { agent } from "@ironflow/node/agent";
import { ocr, classify, publish } from "./tools.js";
interface DocReceived {
docId: string;
imageUrl: string;
}
export const docProcessor = agent(
{
id: "doc-processor",
description: "Durable OCR → classify → publish pipeline",
triggers: [{ event: "doc.received" }],
memory: {
streamId: "agent-memory",
projection: "doc-processor-memory",
},
recording: true,
},
async ({ event, tool, memory, logger }) => {
const { docId, imageUrl } = event.data as DocReceived;
logger.info(`processing ${docId}`);
// Each tool() call is a memoized step. Crash here and the
// worker resumes from the step that was in flight.
const { text } = await tool(ocr, { imageUrl });
const { category } = await tool(classify, { text });
await tool(publish, { docId, category });
// Memory write is also durable — appends to an entity stream
// and waits for the projection to catch up before returning.
await memory.append("doc.processed", {
docId,
status: "published",
category,
});
return { docId, category };
},
);

src/worker.ts:

import { createWorker, type IronflowProjection } from "@ironflow/node";
import { docProcessor } from "./agent.js";
import { docMemory } from "./memory.js";
// createWorker honors an explicit `serverUrl`. Read env so the demo
// script can target a non-default port without editing this file.
// IRONFLOW_URL is the canonical name; IRONFLOW_SERVER_URL is accepted as a fallback.
const serverUrl = process.env.IRONFLOW_URL ?? process.env.IRONFLOW_SERVER_URL;
const worker = createWorker({
functions: [docProcessor],
projections: [docMemory as IronflowProjection],
...(serverUrl ? { serverUrl } : {}),
});
worker.start().then(() => {
console.log("doc-processor worker ready (pid", process.pid + ")");
});

In terminal A start the worker:

Terminal window
pnpm dev

Output:

doc-processor worker ready (pid 41203)

In terminal B trigger a run:

Terminal window
ironflow emit doc.received \
--data '{"docId":"doc-1","imageUrl":"https://example.com/invoice.png"}'

Terminal A logs:

[ocr] extracting text from https://example.com/invoice.png (3s)...
[classify] categorizing 38 chars...
[publish] doc doc-1 → other channel

Run completes in about 3.2 seconds. Confirm with ironflow run list or curl http://localhost:9123/api/v1/runs | jq '.[0]'.

Trigger a fresh run and kill the worker during the 3-second OCR sleep:

Terminal window
# Terminal B
ironflow emit doc.received \
--data '{"docId":"doc-2","imageUrl":"https://example.com/receipt.png"}'

Terminal A logs the start:

[ocr] extracting text from https://example.com/receipt.png (3s)...

Now in a third terminal — use kill -9 (SIGKILL), not Ctrl+C. SIGINT triggers a graceful drain that completes in-flight steps; only an ungraceful kill leaves the run orphaned, which is what we want to demonstrate:

Terminal window
# Terminal C
kill -9 $(pgrep -f "tsx.*worker.ts")

Terminal A:

[1] + 41203 killed pnpm dev

The OCR call was in flight. There is no completed step recorded for it. ironflow run list shows doc-2 as running (orphaned).

Terminal window
# Terminal A
pnpm dev

Output (after the redelivery wait — see callout below):

doc-processor worker ready (pid 41789)
[ocr] extracting text from https://example.com/receipt.png (3s)...
[classify] categorizing 39 chars...
[publish] doc doc-2 → other channel

The worker pulled the orphaned run, replayed any cached steps, and resumed at the in-flight one. OCR re-ran because it never completed. If we had killed the worker between OCR and classify, OCR would replay from cache and only classify + publish would re-run.

Verify:

Terminal window
ironflow inspect $(ironflow run list --limit 1 --json | jq -r '.[0].id')

You’ll see steps tool.ocr, tool.classify, tool.publish, memory.append, memory.append.wait all marked complete.

Query the memory projection. The HTTP response wraps user state in a ProjectionState envelope, so use .state.state to unwrap:

Terminal window
curl -s http://localhost:9123/api/v1/projections/doc-processor-memory | jq '.state.state'
{
"doc-1": { "docId": "doc-1", "status": "published", "category": "other" },
"doc-2": { "docId": "doc-2", "status": "published", "category": "other" }
}

Both docs show through, including the one the worker died on. (The SDK’s client.projections.get() peels the envelope automatically; raw curl does not.)

handler runs step.run records the result
│ │
▼ ▼
tool(ocr, …) ────► step.run("tool.ocr") ────► runs.steps row
(input + output + ts)
crash here?
├──► YES: row never written.
│ resume re-runs the step.
└──► NO: row written.
resume replays from cache.

Two pieces do the work:

  1. Step memoizationtool(), llm(), approve(), spawn(), memory.append() all wrap step.run(). Each call records (runId, stepId, input, output). On replay the runtime returns cached output for any step whose row exists, and only re-executes the first step without a row.

  2. Entity-stream memorymemory.append() writes a doc.processed event to the agent’s entity stream (agent-memory), then waits for the named projection to catch up. The projection partitions state by docId and is rebuilt from events deterministically, so its state survives any node restart and rebuilds identically.

No new server primitives were introduced for any of this. Tools, LLM calls, approvals, spawns, and memory writes all compose over the same step.run + entity-stream surface that powers Ironflow workflows. Agents inherit crash-resume for free.

The Makefile target reproduces the full demo:

Terminal window
make demo-agent-crash-resume

The script:

  1. Starts ironflow serve --dev in the background
  2. Builds the SDK and starts the doc-processor worker
  3. Emits doc.received for doc-2
  4. kill -9s the worker after 1.5s (mid-OCR)
  5. Restarts the worker
  6. Waits for the run to finish, asserts memory state contains doc-2
  7. Tears down

make demo-agent-crash-resume is the literal CI gate behind the agent-survives-crash narrative. It runs in .github/workflows/demo-crash-resume.yml whenever the example, this tutorial, or the Makefile changes.