AI agents

Customs classification agents: LangGraph, CrewAI or DIY

Three orchestrators, one douane-classificatie agent, 3,200 weekly lookups. We scored LangGraph, CrewAI and a hand-rolled Claude loop on cost, replay and Sunday-night patches.

Jacob Molkenboer· Founder · A Brand New Company· 19 Jun 2026· 10 min

Brass customs stamp, ink pad, three twine-tied shipping tags with one green sticker, relay switch on ivory paper.

Sunday, 21:14. The Dordrecht office of a 19-person expeditiebureau is dark. Two declarants are at home, a third is in Antwerp finishing a feeder vessel. The EU has just published a TARIC kwartaalwijziging in the Official Journal — 312 codes changed, six of them in chapters this client touches every day. By Monday 06:30 the first 47 aangiftes need an HS-code that will not bounce at the Douane portal. Whoever owns the customs-classification agent owns this Sunday evening.

That sentence — who owns the Sunday evening — turned out to be the entire framework comparison.

The aangifte we were trying to automate

Our client imports polymer compounds, technical films and a long tail of finished plastic goods. A declarant takes a supplier invoice, finds the right 10-digit TARIC code, confirms preferential origin against the trade agreement, calculates duty and BTW, and files the aangifte through AGS. They do this roughly 3,200 line-items a week. About 70% are repeats of the same materials; the remaining 30% is where mistakes get expensive.

The agent we were asked to build had a small, ugly job: parse the invoice PDF, propose an HS code with a confidence band, surface two or three plausible alternatives with reasoning, and write everything to a place a Douane inspector can read in 2033. The orchestration layer is what the rest of this post is about.

LangGraph: the graph that knew too much

We prototyped on LangGraph first. The directed-graph mental model maps cleanly to the work: parse → propose → cross-check chapter → confirm origin → draft aangifte. LangGraph ships a Postgres checkpointer, replays are a documented feature, and the surrounding ecosystem is mature. We had a working prototype in four days. The graph compiled, the checkpointer wrote rows, replays roundtripped on the happy path. On paper this looked like the answer.

Three things bit us.

First, the state object belongs to the framework. The checkpoint table stores serialised state — readable today, decodable in 2033 only if the deserialiser still exists in the form we have now. When a Douane inspector asks “what did the agent see, and why did it pick 3920.10.81 over 3921.19.00 on aangifte 2026-04-17-0231,” we want one SQL query, not a Python virtualenv frozen at the version we used. We tested this concretely: we loaded a checkpoint written by a graph definition two minor versions back. It read, but a node we had since renamed silently dropped its tool-result on replay. Recoverable in development. Not the kind of footnote that belongs in a customs file.

Second, the people who need to change the graph cannot. Editing a node is editing a Python StateGraph. Our client has two declarants who are comfortable with a YAML file and a small admin UI. Neither is going to redeploy a service on a Sunday evening. We could have built a thin admin layer that mutated the live graph, but at that point we were arguing with LangGraph's mental model rather than using it.

Third, retries on tool-call rate limits gave us non-deterministic step counts in the trace. Pinnable, but more code than we wanted to write to defend a property we already wanted from the ground up. Determinism is cheap to design in and expensive to bolt on.

CrewAI: roles in search of a workflow

We then tried CrewAI with three roles — classifier, auditor, scribe — letting them collaborate. For an open-ended research task this style is genuinely elegant. For a customs declaration with a regulatory paper trail, the autonomy is a tax we did not want to pay.

In the first three runs we could not point at a single agent and say “this one decided.” The trace was a conversation, not a decision tree. In one early run, the classifier and auditor negotiated the same tariff heading across three turns before one yielded; the answer was right, but the path read like the minutes of a small committee. We tightened the tasks and tools until the “crew” was effectively a sequential pipeline — at which point the framework was overhead. Persistence we would have had to bolt on Postgres anyway. CrewAI is a sharp tool. This was not the shape it was sharpened for.

The hand-rolled loop with Postgres as source of truth

We shipped a hand-rolled Claude tool-use loop. About 600 lines of Python. The state machine is one table:

create table agent_run (
  run_id          uuid        primary key,
  aangifte_ref    text        not null,
  step_no         int         not null,
  step_kind       text        not null,  -- 'plan' | 'tool_call' | 'tool_result' | 'final'
  model           text        not null,  -- e.g. 'claude-sonnet-4-5-20250929'
  prompt_hash     text        not null,
  prompt          jsonb       not null,
  tool_name       text,
  tool_args       jsonb,
  tool_result     jsonb,
  output          jsonb,
  taric_snapshot  text        not null,  -- e.g. 'TARIC-2026-Q2-v3'
  created_at      timestamptz not null default now(),
  unique (run_id, step_no)
);
create index on agent_run (aangifte_ref);
create index on agent_run (taric_snapshot);

Each turn of the loop writes one row before the next call goes out. The TARIC snapshot ID is pinned per run, and the chapter rules the agent reads come from a versioned YAML file the declarants own. Nothing about the run lives in framework memory; if the process crashes, the run resumes from the last row.

Per-aangifte cost at 3,200 lookups a week

Numbers from production traffic, not a benchmark deck. Average 4.1 turns per classification, ~6.5k input tokens and ~900 output tokens per turn on Claude Sonnet 4.5, with prompt caching on the stable system prefix. Marginal cost per lookup landed at €0.018. At 3,200 lookups a week that is €57.60/week, ~€3,000/year in inference.

LangGraph and CrewAI ran the same inference bill within noise — the orchestrator does not meaningfully change the prompt. Where the cost diverged was storage and ops.

LangGraph checkpoint table at this load: ~80 MB/year of serialised state.
Our agent_run table at this load: ~420 MB/year, because we store prompts and tool args verbatim in jsonb.

That extra 340 MB/year is the audit trail. On a managed Postgres tier we already had, the marginal cost is a rounding error. We pay it.

Methodology note: the 80 MB figure is from LangGraph's reference Postgres schema with default settings on our traffic shape; with periodic checkpoint pruning you can keep it lower, but pruning is what we did not want. The 420 MB is unpruned, uncompressed JSONB. We considered enabling Postgres column compression on the prompt column and decided the operational simplicity of “what you see in the row is what the agent saw” was worth the disk.

Step history the Douane will accept

Dutch expeditiebureaus generally retain customs records for seven years to line up with the fiscale bewaarplicht, even though the Union Customs Code baseline is three. We design for ten, on the principle that the next regulator to ask a hard question about an AI-classified aangifte will not be the one we planned for.

Replay-defensible, for us, means five concrete properties:

One row per agent step, joinable by aangifte_ref.
Prompt and tool arguments stored verbatim, not summarised.
TARIC snapshot pinned per run, with a separate table of snapshot manifests.
Model name and exact version pinned per row.
The final aangifte joinable to the run that produced it.

LangGraph can satisfy four of those today, with effort. The fifth — the “will the deserialiser still work in 2033” bet — is the one we did not want to take. A flat Postgres row with JSON columns is the longest-lived data structure we know.

Warning

If your audit trail depends on your orchestration framework still parsing its own checkpoint format in seven years, you are not really keeping records. You are keeping a hostage.

Two bugs the audit trail caught

In the first month after go-live we found two bugs we would not have caught without prompt and tool args stored verbatim.

The first was a chapter-rule regression. A declarant edited the YAML for chapter 39, dropped a comma, and the loader parsed the file but truncated a guidance string mid-sentence. The agent kept classifying. Its confidence band on heading 3920 quietly collapsed by about eight points. We noticed when the nightly aggregate confidence dropped, and proved within an hour that it had nothing to do with the model by joining run rows to the YAML version they had pinned.

The second was quieter. A supplier swapped to a new invoice template that pushed the product description into a sidebar. Our PDF parser still returned a value: the first text run on the page, which happened to be a logo caption. The agent obediently classified “BrandCo Solutions” as something near enough to surface a code. The trail showed the bad parse sitting in tool_args verbatim; we added a sanity check (length and detected language), then replayed three weeks of affected aangiftes to flag the seven that needed re-filing.

Neither bug is exotic. The point is we found both by joining four tables and reading text. No serialiser, no framework version pin, no time spent reconstructing what the model “must have” seen.

Patching the graph on a Sunday evening

This is the decision the comparison actually turned on. TARIC publishes quarterly delta updates, sometimes on weekends. The client did not want a vendor — us — on the critical path for every kwartaalwijziging. They wanted the decision to stay in the building.

With LangGraph, editing a node means editing the graph definition, redeploying, replaying tests. The declarants do not do that. With CrewAI, the autonomy meant a config change had non-obvious runtime effects; not what you want at 22:00 on a Sunday. With the hand-rolled loop, a chapter rule lives in YAML:

chapter: "39"
label: "Plastics and articles thereof"
taric_snapshot_min: "TARIC-2026-Q2-v1"
guidance: |
  Always require preferential-origin evidence (ATR or EUR.1) when
  the supplier country is TR, MA or any EU FTA partner.
  Reject confidences below 0.78 for headings 3920 and 3921;
  surface 3 alternatives instead.
fallback_human_review: true

When the new TARIC drops, a diff script flags which chapters moved. The declarants approve the new rules from a small admin page. We get a Slack ping only if the diff touches a rule the declarants do not recognise.

In the eighteen months the agent has been live, we have been paged once. The declarants have approved seven kwartaalwijzigingen on their own. The Q3 2025 update was the noisy one: 41 codes in chapter 39 moved, including three the client used several times a week. The diff script flagged it Sunday at 18:00; by 20:30 the senior declarant had approved a new ruleset, the YAML version bumped, and Monday's first 47 aangiftes used the updated guidance. We heard about it at the standup.

Takeaway

Choose the orchestrator your client can still operate, on a Sunday evening, in 2030.

What we shipped

LangGraph and CrewAI are both good at what they are for. Neither was for a 19-person expeditiebureau that needs to patch agent behaviour at 21:30 on a Sunday without an engineer in the loop, and needs to defend the decision in front of a Douane inspector seven years later. The hand-rolled Claude loop with a Postgres step table cost about the same to run, gave us an audit trail we can read with one SQL query, and put the patch authority back where it belonged.

When we built this customs-classification stack at ABN, the thing that surprised us was how much the architecture choice was really an HR choice — who edits what, on which evening, with which deploy key. We ended up solving it with a YAML rules file and an admin page so the declarants own the moving parts. If you are weighing the same trade-off, our notes on shipping AI agents are where we keep the longer playbook.

One thing you can do today: open the audit table your current agent writes to and ask whether you could re-run a single classification from 2024 — same prompt, same model version, same rule snapshot — and get within a token of the same answer. If not, you have a decision to make before the next regulator does.

Key takeaway

The right agent orchestrator is the one your client can still operate, audit and patch on a Sunday evening seven years from now — not the one with the prettiest graph.

FAQ

Why not just use LangGraph's Postgres checkpointer?

It works today. Our concern was 2033: the checkpoint is serialised state owned by the framework. A flat row with JSON columns is the longest-lived audit format we know, and the Douane wants one SQL query, not a frozen virtualenv.

Is CrewAI ever the right choice for compliance work?

Rarely. Its strength is autonomous role collaboration, which is a tax when every step needs a regulatory paper trail. For open-ended research and drafting tasks it is a sharp tool. For customs declarations it was overhead.

What did per-aangifte inference cost land at?

About €0.018 per HS-code lookup on Claude Sonnet 4.5 with prompt caching, averaging 4.1 turns per classification. At 3,200 lookups a week that is roughly €3,000 a year in inference, before storage and ops.

How do you handle TARIC quarterly updates without redeploying?

Chapter rules live in a versioned YAML file the declarants own. A diff script flags chapters that moved in the new TARIC, declarants approve changes via an admin page, and we are only paged when the diff touches an unfamiliar rule.

ai agentsarchitecturecase studyoperationsintegrationsautomation

Building something?

Start a project