Survive a Crash
Ironflow agents are durable — every tool call is memoized as a recorded step, every memory write goes through an event-sourced entity stream. Crash the worker mid-run, restart it, and the agent picks up from the last completed step instead of starting over.
This tutorial walks through the literal demo we run for make demo-agent-crash-resume. Build a 3-step document-processing agent, kill it mid-pipeline, watch it resume.
┌──────────────────────────────────────────────┐ │ │ doc.received ──► OCR ──► classify ──► publish ──► done │ │ ▲ │ │ │ │ │ kill -9 │ │ right here │ │ │ │ │ ▼ │ │ worker restarts │ │ │ │ │ ▼ │ │ OCR (cached) ──► classify ──► publish │ │ ▲ │ │ │ │ │ resumes from here │ └──────────────────────────────────────────────┘What survives the crash
Section titled “What survives the crash”| Agent state | Where it lives | After kill -9? |
|---|---|---|
| Completed step output | Step memoization table (runs.steps) | ✅ replayed from cache |
| Memory writes | Entity stream (agent-memory) | ✅ projection re-derives |
| In-flight tool call | Process memory only | ❌ re-runs from start |
| LLM tokens spent | Provider-side | ❌ already paid for the partial call |
You only re-run the step that was in flight when the worker died. Everything before it is replayed from the cache.
Prerequisites
Section titled “Prerequisites”- Ironflow server running locally —
ironflow serve --devathttp://localhost:9123 - Node.js 20+, pnpm
- Two terminal windows
- If running the bundled demo via
make demo-agent-crash-resume:make buildfirst sobuild/ironflowexists.
See Installation if the server is not yet running.
Step 1 — Scaffold the agent
Section titled “Step 1 — Scaffold the agent”The full code is in examples/agents/doc-processor-agent/. To follow along inline:
mkdir -p crash-demo/src && cd crash-demopackage.json:
{ "name": "crash-demo", "private": true, "type": "module", "scripts": { "dev": "tsx watch src/worker.ts", "start": "tsx src/worker.ts" }, "dependencies": { "@ironflow/node": "*", "zod": "^4.3.5" }, "devDependencies": { "tsx": "^4.19.0", "typescript": "^5.6.0", "@types/node": "^22.0.0" }}pnpm installStep 2 — Define the tools
Section titled “Step 2 — Define the tools”src/tools.ts:
import { defineTool } from "@ironflow/node/agent";import { z } from "zod";
// Slow OCR — simulates a real API. The 3-second sleep is the kill window.export const ocr = defineTool({ name: "ocr", description: "Extract text from a document image", input: z.object({ imageUrl: z.string().url() }), handler: async ({ imageUrl }) => { console.log(`[ocr] extracting text from ${imageUrl} (3s)...`); await new Promise((r) => setTimeout(r, 3000)); return { text: `Extracted text from ${imageUrl}`, pages: 1 }; },});
export const classify = defineTool({ name: "classify", input: z.object({ text: z.string() }), handler: async ({ text }) => { console.log(`[classify] categorizing ${text.length} chars...`); await new Promise((r) => setTimeout(r, 200)); return { category: text.includes("invoice") ? "invoice" : "other" }; },});
export const publish = defineTool({ name: "publish", input: z.object({ docId: z.string(), category: z.string() }), handler: async ({ docId, category }) => { console.log(`[publish] doc ${docId} → ${category} channel`); return { published: true }; },});Step 3 — Memory projection
Section titled “Step 3 — Memory projection”src/memory.ts:
import { createProjection } from "@ironflow/node";
interface DocState { docId: string; status: "ocr" | "classified" | "published"; category?: string;}
export const docMemory = createProjection({ name: "doc-processor-memory", events: ["doc.processed"], initialState: (): Record<string, DocState> => ({}), handler: (state, event) => { const data = event.data as DocState; return { ...state, [data.docId]: data }; },});Step 4 — Wire the agent
Section titled “Step 4 — Wire the agent”src/agent.ts:
import { agent } from "@ironflow/node/agent";import { ocr, classify, publish } from "./tools.js";
interface DocReceived { docId: string; imageUrl: string;}
export const docProcessor = agent( { id: "doc-processor", description: "Durable OCR → classify → publish pipeline", triggers: [{ event: "doc.received" }], memory: { streamId: "agent-memory", projection: "doc-processor-memory", }, recording: true, }, async ({ event, tool, memory, logger }) => { const { docId, imageUrl } = event.data as DocReceived; logger.info(`processing ${docId}`);
// Each tool() call is a memoized step. Crash here and the // worker resumes from the step that was in flight. const { text } = await tool(ocr, { imageUrl }); const { category } = await tool(classify, { text }); await tool(publish, { docId, category });
// Memory write is also durable — appends to an entity stream // and waits for the projection to catch up before returning. await memory.append("doc.processed", { docId, status: "published", category, });
return { docId, category }; },);Step 5 — Worker entrypoint
Section titled “Step 5 — Worker entrypoint”src/worker.ts:
import { createWorker, type IronflowProjection } from "@ironflow/node";import { docProcessor } from "./agent.js";import { docMemory } from "./memory.js";
// createWorker honors an explicit `serverUrl`. Read env so the demo// script can target a non-default port without editing this file.// IRONFLOW_URL is the canonical name; IRONFLOW_SERVER_URL is accepted as a fallback.const serverUrl = process.env.IRONFLOW_URL ?? process.env.IRONFLOW_SERVER_URL;
const worker = createWorker({ functions: [docProcessor], projections: [docMemory as IronflowProjection], ...(serverUrl ? { serverUrl } : {}),});
worker.start().then(() => { console.log("doc-processor worker ready (pid", process.pid + ")");});Step 6 — Run the happy path
Section titled “Step 6 — Run the happy path”In terminal A start the worker:
pnpm devOutput:
doc-processor worker ready (pid 41203)In terminal B trigger a run:
ironflow emit doc.received \ --data '{"docId":"doc-1","imageUrl":"https://example.com/invoice.png"}'Terminal A logs:
[ocr] extracting text from https://example.com/invoice.png (3s)...[classify] categorizing 38 chars...[publish] doc doc-1 → other channelRun completes in about 3.2 seconds. Confirm with ironflow run list or curl http://localhost:9123/api/v1/runs | jq '.[0]'.
Step 7 — Crash mid-run
Section titled “Step 7 — Crash mid-run”Trigger a fresh run and kill the worker during the 3-second OCR sleep:
# Terminal Bironflow emit doc.received \ --data '{"docId":"doc-2","imageUrl":"https://example.com/receipt.png"}'Terminal A logs the start:
[ocr] extracting text from https://example.com/receipt.png (3s)...Now in a third terminal — use kill -9 (SIGKILL), not Ctrl+C. SIGINT triggers a graceful drain that completes in-flight steps; only an ungraceful kill leaves the run orphaned, which is what we want to demonstrate:
# Terminal Ckill -9 $(pgrep -f "tsx.*worker.ts")Terminal A:
[1] + 41203 killed pnpm devThe OCR call was in flight. There is no completed step recorded for it. ironflow run list shows doc-2 as running (orphaned).
Step 8 — Restart and resume
Section titled “Step 8 — Restart and resume”# Terminal Apnpm devOutput (after the redelivery wait — see callout below):
doc-processor worker ready (pid 41789)[ocr] extracting text from https://example.com/receipt.png (3s)...[classify] categorizing 39 chars...[publish] doc doc-2 → other channelThe worker pulled the orphaned run, replayed any cached steps, and resumed at the in-flight one. OCR re-ran because it never completed. If we had killed the worker between OCR and classify, OCR would replay from cache and only classify + publish would re-run.
Verify:
ironflow inspect $(ironflow run list --limit 1 --json | jq -r '.[0].id')You’ll see steps tool.ocr, tool.classify, tool.publish, memory.append, memory.append.wait all marked complete.
Query the memory projection. The HTTP response wraps user state in a ProjectionState envelope, so use .state.state to unwrap:
curl -s http://localhost:9123/api/v1/projections/doc-processor-memory | jq '.state.state'{ "doc-1": { "docId": "doc-1", "status": "published", "category": "other" }, "doc-2": { "docId": "doc-2", "status": "published", "category": "other" }}Both docs show through, including the one the worker died on. (The SDK’s client.projections.get() peels the envelope automatically; raw curl does not.)
Why it works
Section titled “Why it works” handler runs step.run records the result │ │ ▼ ▼ tool(ocr, …) ────► step.run("tool.ocr") ────► runs.steps row (input + output + ts) │ ▼ crash here? │ ├──► YES: row never written. │ resume re-runs the step. │ └──► NO: row written. resume replays from cache.Two pieces do the work:
-
Step memoization —
tool(),llm(),approve(),spawn(),memory.append()all wrapstep.run(). Each call records(runId, stepId, input, output). On replay the runtime returns cached output for any step whose row exists, and only re-executes the first step without a row. -
Entity-stream memory —
memory.append()writes adoc.processedevent to the agent’s entity stream (agent-memory), then waits for the named projection to catch up. The projection partitions state bydocIdand is rebuilt from events deterministically, so its state survives any node restart and rebuilds identically.
No new server primitives were introduced for any of this. Tools, LLM calls, approvals, spawns, and memory writes all compose over the same step.run + entity-stream surface that powers Ironflow workflows. Agents inherit crash-resume for free.
Run it with one command
Section titled “Run it with one command”The Makefile target reproduces the full demo:
make demo-agent-crash-resumeThe script:
- Starts
ironflow serve --devin the background - Builds the SDK and starts the doc-processor worker
- Emits
doc.receivedfordoc-2 kill -9s the worker after 1.5s (mid-OCR)- Restarts the worker
- Waits for the run to finish, asserts memory state contains
doc-2 - Tears down
make demo-agent-crash-resume is the literal CI gate behind the agent-survives-crash narrative. It runs in .github/workflows/demo-crash-resume.yml whenever the example, this tutorial, or the Makefile changes.
Next steps
Section titled “Next steps”- Code-review agent — adds
llm()+approve()for human-in-the-loop gates. - Doc-processor agent — the full version of this tutorial with real tests.
- AI Research agent — multi-step research with parallel search and summarize.
- Comparison: where Ironflow sits vs. LangGraph, Claude Agent SDK, Temporal, and others.
@ironflow/langgraph— same crash-resume guarantees for LangGraph graphs.