AI agents

Burr vs LangGraph vs XState: picking a claims agent core

It's 21:13 on a Thursday in Mechelen. The claims agent is asking for a policy number that arrived three turns ago. Which framework you picked six months ago decides what you fix next.

Jacob Molkenboer· Founder · A Brand New Company· 11 Jun 2026· 8 min

Vintage wooden telephone switchboard with brass jacks, braided cables, green ribbon, index card and ink pen on ivory paper.

It's 21:13 on a Thursday in Mechelen. The duty phone of a 17-person insurance broker buzzes once. Their claims-intake agent has just asked a policyholder, politely and in Dutch, for a policy number that was sitting in the customer's original email three turns ago. The junior on call has a laptop, a coffee, and roughly forty-five minutes before the next batch of overnight email lands and the compounding begins.

What that junior is allowed to do in the next forty-five minutes depends almost entirely on which orchestration framework someone picked six months earlier. So when the broker asked us to rebuild their intake pipeline, we ran a serious three-way bake-off: Burr, LangGraph, and a hand-rolled XState v5 machine running on Bun. Same agent. Same prompts. Same model. Three orchestrators.

The decision wasn't "which framework is best." That question is unanswerable. The decision was: which one survives a Thursday at 21:13 with a junior at the keyboard.

What claims-intake actually has to do

The agent isn't exotic. It reads an inbound email or web-form submission, identifies the policy, asks for any missing information (photos of damage, third-party details, a sketch of the accident if it's a car claim), drafts a structured dossier, and parks it in the adjuster's queue. About twelve steps, four tools (policy lookup, OCR for the schadeformulier, a Belfius-flavoured IBAN validator, the dossier writer), and three branching points. A claim takes between two minutes and three days of clock time, depending on how fast the customer replies.

So the agent is long-running, stateful, conversational, and audited. FSMA and GDPR are watching. If we lose state, we either re-pester a customer who already gave us the data or, worse, we silently drop a piece of evidence. Both end up in a complaints file.

That's the actual job. Now the three contenders.

Burr: replay as a first-class citizen

Burr has been a quietly mature project from DAGWorks for over a year. It models the agent as a state machine where each action is a Python function that reads from and writes to a typed state object. The Burr Tracker, a small local web app, shows every step, every state delta, every tool call, and lets you fork a run from any prior step.

Forking from step three to test a fix looks like this:

from burr.core import ApplicationBuilder, expr, default
from burr.tracking import LocalTrackingClient

app = (
    ApplicationBuilder()
    .with_state(claim_id="CL-8821", policy_number=None, photos=[])
    .with_actions(
        extract_policy_number,
        fetch_policy,
        request_missing_info,
        compile_dossier,
    )
    .with_transitions(
        ("extract_policy_number", "fetch_policy", expr("policy_number is not None")),
        ("extract_policy_number", "request_missing_info", default),
    )
    .with_tracker(project="claims-intake")
    .with_identifiers(partition_key="CL-8821")
    .build()
)

# Fork from step 3 of yesterday's run
forked = app.fork(parent_partition_key="CL-8821", sequence_id=3)

The good: replay is the product, not a feature bolted on. The Tracker UI is the single best agent-debugging interface we tested. Schema is implicit but legible, because the state object is a typed dict and you read its shape by reading the action signatures.

The bad: it's Python. For a Belgian broker whose existing stack is TypeScript across the front, the back, and the Outlook add-in, adding a Python service means a second deployable, a second dependency manager (uv, in our case), and a second on-call rotation. None of that is dealbreaking. It does cost a junior an extra hour of context-switch every time they touch it.

LangGraph: checkpoints, threads, and a paid observability tier

LangGraph models the agent as a directed graph. Nodes are functions that take state and return a partial update. Edges, conditional or fixed, decide where to go next. State is a TypedDict with reducer annotations on fields that need to merge rather than overwrite. Persistence is via a checkpointer (SQLite, Postgres, or custom), keyed by a thread_id.

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
graph = builder.compile(checkpointer=checkpointer)

config = {"configurable": {"thread_id": "CL-8821"}}

# Walk every checkpoint in this thread
for cp in graph.get_state_history(config):
    print(cp.config["configurable"]["checkpoint_id"], cp.values, cp.next)

# Resume from a specific checkpoint
graph.invoke(
    None,
    config={"configurable": {
        "thread_id": "CL-8821",
        "checkpoint_id": "1ef..."
    }},
)

The good: thread-level checkpointing is reliable, well-documented, and survives a process restart cleanly. The TypeScript port is real and current. LangSmith, the paid observability layer, is genuinely good once you wire it up.

The not-as-good: the moment your state TypedDict changes shape, old checkpoints become a foot-gun. There is no first-class migration story. You either version your state, version your thread_ids, or accept that yesterday's in-flight claims may explode on the next deploy.

Warning

If you change a LangGraph state field's type or rename a key, every existing in-flight thread becomes a small landmine. Add a schema_version field to your state from day one and branch your reducer on it. We learned this the slow way.

The other thing: at 21:13, a junior staring at LangSmith looking for which node turned policy_number back into None is doing forensic work in a separate web app. Burr's Tracker keeps you closer to the code.

XState on Bun: types all the way down

The third option was to skip the agent frameworks entirely and treat orchestration as what it actually is: a statechart. XState v5 is a mature TypeScript statecharts library. Bun gives us a fast TS runtime with bun:sqlite built in. Persistence is an append-only event log; replay is just feeding the events back to the actor.

import { createActor } from 'xstate';
import { Database } from 'bun:sqlite';
import { claimsIntakeMachine } from './machine';

const db = new Database('events.db');

function rehydrate(claimId: string, upToSeq?: number) {
  const rows = db.query(
    `SELECT event FROM intake_events
     WHERE claim_id = ?1 ${upToSeq ? 'AND seq <= ?2' : ''}
     ORDER BY seq`
  ).all(claimId, upToSeq) as { event: string }[];

  const actor = createActor(claimsIntakeMachine);
  actor.start();
  for (const row of rows) actor.send(JSON.parse(row.event));
  return actor;
}

// Inspect state at step 3, then run forward
const actor = rehydrate('CL-8821', 3);
console.log(actor.getSnapshot().value, actor.getSnapshot().context);

The good: one language across the entire stack. The machine definition is the schema. Stately Studio, free, renders the machine as a diagram a non-engineer can read and approve, which mattered more than expected when the broker's compliance officer wanted to sign off on intake flow. A junior on call edits one TypeScript file, runs bun test, and ships.

The cost: you write the persistence layer, the replay UI, the checkpoint-and-resume semantics, and the cancellation logic yourself. For a twelve-step state machine with four tools, that's about three hundred lines, but it's three hundred lines you own and have to keep correct. Burr gives that to you for free.

Scoring against Thursday night

With all three running the same agent against a fixture of one hundred replayed claims, the scoring shook out like this.

Replay-from-step debugging

Burr wins outright. The Tracker shows you the run as a list of steps, each click reveals the state delta and tool I/O, and you fork from any step with one button. LangGraph is close behind if you've paid for LangSmith, painful if you haven't. XState is whatever you build. Our hand-rolled replay needed a small terminal viewer before it was as usable as Burr's Tracker out of the box, and we still don't have a way to diff state across runs the way Burr does.

Who owns the state schema migration

XState wins, by virtue of the schema being the machine and the events being immutable facts. When the schema changes, you either deploy a new version of the machine and route new claims to it, or you write a one-shot event-rewriting migration. Both are boring and obvious. Burr is honest about it: state is a typed Python object, migrations are your code, persistence is your call. LangGraph quietly hopes you won't change the shape. You will.

What a junior can hot-fix at 21:00

This was the criterion that decided the project. We sat down the broker's most junior dev, gave them a seeded version of the 21:13 bug, and timed three runs.

Burr: 34 minutes. Read the Tracker, found the action returning the wrong state, edited extract_policy_number, redeployed the Python service, replayed the affected partition.
LangGraph: 58 minutes. The bug was a reducer that overwrote policy_number when it should have kept the existing value. Finding it in LangSmith was fine. Convincing themselves the fix wouldn't poison existing checkpoints took the rest of the hour.
XState on Bun: 22 minutes. The bug was a guard that returned the wrong branch. Edit one file, bun test showed two failing transitions, fix, deploy.

The XState advantage wasn't framework magic. It was the absence of a framework. A junior staring at a state machine in their own language, with their own persistence model, owes no one an apology when they change something. They just change it.

What we shipped

For this client we shipped XState v5 on Bun, with an append-only event log in Postgres (not SQLite; the broker already runs Postgres and didn't want a second store), a small replay endpoint, and Stately Studio diagrams checked in next to the machine file. We carved out roughly four days of work to build the persistence and replay UI we'd otherwise have got for free with Burr. We got it back inside the first month in hot-fix time, and we got something Burr can't offer: a diagram the compliance officer can read.

If the same client had been running a Python-first stack, we would have picked Burr without hesitation. Replay-as-product is genuinely a category-defining feature, and the Tracker UI is worth real money. We would not have picked LangGraph for an agent that needs to survive schema changes mid-flight. We'd happily pick it for a chat assistant that resets at the end of every session.

Takeaway

Don't pick an agent framework on benchmarks. Pick on what a junior dev can safely change at 21:00 on a Thursday. That's the day the agent goes to production.

The hot-fix from the opening of this post took 22 minutes in the simulated run. The fix that mattered, though, was the architectural one six months earlier: the decision to model intake as a statechart, persist events, and stay in one language. When we built the claims-intake AI agent for the Mechelen broker, the thing we kept running into was that the orchestrator's debugging story decides your on-call quality of life. We ended up solving it by making the orchestrator something a junior could read before lunch.

If you want to know whether your current agent stack passes the 21:13 test, the cheapest audit you can do tonight: pick a step three turns into a real run, and ask out loud how you would replay from there, what state would be wrong if you changed the schema next week, and who on your team could fix a guard expression unsupervised. If two of those three answers make you wince, you have a Thursday-night problem waiting to happen.

Key takeaway

Pick an agent orchestrator on what a junior dev can safely change at 21:00 on a Thursday. That's the day the agent goes to production.

FAQ

Why not just use the latest agent framework from a big vendor?

Vendor frameworks bias toward chat assistants that reset at session end. Long-running, audited agents like claims-intake punish that assumption when state needs to survive a process restart, a schema change, or a hot-fix at 21:00.

Does Bun matter, or is Node fine?

Node is fine. Bun gave us native TypeScript, built-in SQLite, and a faster cold-start for the replay endpoint. The architecture works on any modern JS runtime; we wouldn't rewrite a working Node service just to use Bun.

How big does a team need to be before this comparison stops mattering?

Around fifty engineers, frameworks earn their keep because more people benefit from shared infrastructure. Below twenty, the cost of one extra deployable and one extra language is the deciding factor, not the feature list.

Would you pick LangGraph for anything?

Yes. Short-lived chat assistants, RAG copilots, internal tools where state resets every session. It is the schema-migration story that hurts at production scale for long-running agents that span days.

ai agentsarchitecturetoolingworkflowoperations

Building something?

Start a project