Strategy

KYC agent orchestration: LangGraph, Mastra, or hand-rolled

At 23:07 the iDIN consent endpoint moves. 3,400 weekly checks queue up. Whoever owns the orchestration layer owns the pager. We compared three options.

Jacob Molkenboer· Founder · A Brand New Company· 17 Jun 2026· 9 min

Brass switchboard with three patch cables, one tipped with green wax seal, paper slip stamped 23:07 on ivory linen.

It is 23:07 on a Tuesday in Tilburg. The on-call engineer at a 25-person fintech gets paged. iDIN, the Dutch bank-issued identity service, has quietly rotated its consent endpoint as part of a scheduled change window. The KYC agent is throwing 422s on every nieuwe aanvraag. By morning the queue will hold 480 unprocessed identiteitscontroles and a customer-success lead asking why onboarding is frozen.

The question that night is not which framework has the best DX. It is who can put a patched workflow into production before 06:30, and whether the decision trail will still hold up when DNB asks how that aanvraag was scored six months from now.

We have built KYC pipelines on three different orchestration layers in the last twelve months. Here is what each one costs, breaks, and survives.

The scoring grid

Three axes. Anything else is a distraction.

Cost per aanvraag at steady state. The Tilburg shop runs 3,400 checks per week, roughly 177,000 per year.
Wwft-defensible step replay. If a transaction-monitoring officer asks why an aanvraag was flagged for EDD, the system has to reconstruct the full decision path: model inputs, tool outputs, prompt versions, timestamps. Seven years of retention.
Time-to-patch when an upstream API drifts. iDIN, Veriff, Onfido, the KvK handelsregister: they all change shape. The 23:00 incident is the real test.

We benchmarked LangGraph 0.4, Mastra 0.10, and a hand-rolled stack on the Claude Agent SDK plus Postgres. Same prompts, same tools, same models (Sonnet for orchestration, Haiku for the cheap classifier calls).

LangGraph trial

LangGraph models the pipeline as a stateful graph: nodes are tools or LLM calls, edges are conditional transitions, state is checkpointed automatically. For Wwft replay you can rehydrate any run from the persisted checkpoint store. That part is real and it works.

The catch is the abstraction tax. A KYC flow has maybe nine nodes: parse the BSN, hit the iDIN consent endpoint, fetch UBO data from KvK, run the sanctions screen, classify the risk, route to EDD or pass, write audit log, notify the customer, close the case. In LangGraph that came out as about 600 lines of Python including the State TypedDict, conditional edges, and the LangSmith tracing wiring.

from langgraph.graph import StateGraph, END

class KYCState(TypedDict):
    bsn: str
    idin_token: str | None
    ubo: list[dict] | None
    sanctions_hits: list[dict]
    risk_band: Literal["pass", "edd", "deny"]
    audit: list[dict]

graph = StateGraph(KYCState)
graph.add_node("idin", call_idin)
graph.add_node("kvk_ubo", fetch_ubo)
graph.add_node("sanctions", screen_sanctions)
graph.add_node("classify", llm_classify_risk)
graph.add_conditional_edges("classify", route_by_band, {
    "pass": "audit_pass",
    "edd": "audit_edd",
    "deny": "audit_deny",
})

When iDIN rotated its endpoint, the patch was a one-line URL change. Easy. But two weeks earlier, a LangChain 0.3.x point release had quietly changed how nested tool calls resolved inside MessagesPlaceholder, which broke our classifier prompt cache for six hours before anyone noticed. The pager event was framework-internal, not vendor-internal. That is the LangGraph trade: the framework gives you a lot, and you inherit its release cadence.

Cost per aanvraag in our run: about €0.011 in model spend, plus LangSmith if you want the replay UI without self-hosting. We self-hosted, so €0 hosted. Engineering overhead settled at roughly 0.4 FTE just to track upstream changes.

Mastra trial

Mastra is the cleanest TypeScript option in this space right now. Workflows are composable steps with typed inputs and outputs, agents are first-class, and the local dev UI shows step-by-step traces out of the box. For a Dutch shop already writing Next.js on Vercel, the friction is almost zero.

import { Workflow, Step } from "@mastra/core";

const kyc = new Workflow({ name: "kyc-aanvraag" })
  .step(new Step({
    id: "idin",
    execute: async ({ context }) => idin.exchange(context.bsn),
  }))
  .step(new Step({
    id: "ubo",
    execute: async ({ context }) => kvk.ubo(context.kvkNummer),
  }))
  .then("classify")
  .commit();

The Wwft story is weaker out of the box. Mastra persists workflow runs, but the schema is theirs, the replay UI is theirs, and there is no documented seven-year retention pattern. We ended up writing our own append-only audit table in Postgres and emitting structured events from every step. That worked, but it meant we were doing half the work the framework was supposed to absorb.

The 23:00 patch was the easiest of the three: hot reload, redeploy on Vercel in under three minutes. The catch is that Mastra is young. The 0.10 release in our window shipped two breaking changes from 0.9. If you adopt today, you are committing to tracking a fast-moving project. Their own docs are honest about it.

Cost per aanvraag: about €0.011 in model spend. Engineering overhead: 0.3 FTE, lower because the surface area is smaller.

Hand-rolled Claude Agent SDK plus Postgres

This is what we shipped to production. The orchestrator is about 900 lines of TypeScript. Each KYC step is a function that takes a case_id, reads the previous step from a kyc_steps table in Postgres, calls the next tool or model, and writes the result back as an append-only row.

create table kyc_steps (
  id           bigserial primary key,
  case_id      uuid not null,
  step_name    text not null,
  step_input   jsonb not null,
  step_output  jsonb,
  model_id     text,
  prompt_hash  text,
  started_at   timestamptz not null default now(),
  finished_at  timestamptz,
  cost_eur     numeric(10,6)
);
create index on kyc_steps (case_id, started_at);

Replay is a SQL query. Given a case_id, select all steps in order and you have the full decision path, including the prompt hash so you can prove which version of the template ran. A Wwft auditor walks in, you hand them a CSV. We never had to argue with the supervisor about what counts as reconstructible.

The 23:00 iDIN patch was a twelve-line diff in the idinConsent function and a redeploy. No framework upgrade, no checkpoint migration, no breaking-change changelog to read at midnight. The on-call engineer had it back up at 23:41.

The catch is real: you own everything. There is no LangSmith dashboard to send to your CISO, no Mastra DevTools UI for the product team. You build the observability you need, or you skip it. For a 25-person shop with one senior backend engineer who likes Postgres, that is fine. For a five-person team that needs everything to work out of the box, it is a bad bet.

Cost per aanvraag: about €0.0094 in model spend, because we strip framework overhead from each call and lean on the Anthropic prompt cache aggressively. Engineering overhead: 0.5 FTE for the build, 0.2 FTE ongoing.

Per-aanvraag economics at 177k per year

Steady-state numbers, rounded to euros:

                  LangGraph    Mastra      Hand-rolled
Model spend       €1,947       €1,947      €1,665
Hosting           €240         €0 (Vercel) €180
Eng overhead*     €38,400      €28,800     €19,200 (yr 2+)
Total / year      €40,587      €30,747     €21,045
Per aanvraag      €0.230       €0.174      €0.119
* 0.4 / 0.3 / 0.2 FTE @ €96k loaded

The model-spend difference is small. The engineering-overhead difference is huge, and it is the only number that matters past month three. A framework that costs you half an FTE to track is more expensive than a hand-rolled stack at any plausible scale below a million aanvragen per year.

The Wwft step-replay test

We ran the same fictional EDD escalation through all three stacks and asked a single question: can a compliance officer with read-only Postgres access reconstruct, in under five minutes, why aanvraag #A-44219 was flagged?

LangGraph: yes, via the LangSmith replay UI or by reading the checkpoint store directly. Requires the officer to understand the framework's state shape.
Mastra: partial. The workflow run is there, but our custom audit table did the heavy lifting.
Hand-rolled: yes, via a single SQL query. The schema is the schema we designed for compliance, not for the framework.

Takeaway

If the audit table is the product, do not let the orchestration framework own its shape. Compliance schemas outlive frameworks by a decade.

The 23:00 patch question

This is the one nobody asks during the proof-of-concept and everybody asks at the post-mortem. When iDIN rotates its consent endpoint at 23:00 on a Tuesday, who patches the workflow?

LangGraph: your senior Python engineer, after first checking whether the underlying issue is in LangChain core, LangGraph, or your code. Median time-to-fix in our trial: 47 minutes.

Mastra: your senior TypeScript engineer. Mastra has a thinner abstraction so the suspect surface is smaller. Median time-to-fix: 18 minutes.

Hand-rolled: whoever is on call, because the code reads like the rest of the codebase. Median time-to-fix: 11 minutes.

The usual advice is to pick the orchestration layer you can fork. The sharper version: pick the one your on-call engineer can fix at 23:00 without reading a changelog.

What we shipped

The Tilburg fintech ended up on the hand-rolled stack. Three months in, per-aanvraag cost is €0.12, the Wwft audit was passed without findings, and the on-call rotation has fired twice. Both times the fix was a pull request, not a framework upgrade.

When we built the KYC AI agents for this client, the thing we ran into was that every look how easy this is framework demo assumes a workflow that does not change. We ended up solving it by treating the audit table as the product and the orchestration code as disposable scaffolding around it.

If you are running anything regulated and you cannot answer the 23:00 question in one sentence, open your audit table tonight and write the SQL you would hand to a DNB inspector tomorrow. If that query is more than five lines, the framework is in your way.

Key takeaway

Pick the orchestration layer your on-call engineer can patch at 23:00 without reading a changelog. Compliance schemas outlive frameworks.

FAQ

Is LangGraph a bad choice for KYC?

No. It works and the checkpoint replay is real. The cost shows up in framework drift: you spend roughly 0.4 FTE per year tracking LangChain releases that you did not ask for.

Is Mastra production-ready for Dutch regulated work?

For TypeScript shops that own the audit story themselves, yes. For Wwft-bound work, plan on writing your own append-only audit table. Do not rely on the built-in run history for seven-year retention.

What does Wwft step replay actually require?

A reconstructible decision path with inputs, outputs, model id, prompt version, and timestamps, retained seven years. Most teams under-spec the prompt-versioning part.

How long did the hand-rolled build take?

Two senior engineers, six weeks, including the Postgres audit schema, prompt versioning, a minimal internal replay UI, and integration tests against an iDIN sandbox.

strategyai agentsautomationarchitectureintegrationsworkflow

Building something?

Start a project