← Blog

Strategy

Track-and-trace agent stack: how a Dutch 3PL should choose

A method for choosing between Vercel AI SDK, self-hosted Dify, and a custom Claude Agent SDK for the 18,000 weekly statusvragen a sub-€20M Dutch 3PL gets.

Jacob Molkenboer· Founder · A Brand New Company· 16 Oct 2025· 6 min
Wooden sorting board with three brass tags on twine, one green ribbon, red wax seal on cream waybill, ivory surface.

It is 16:47 on a Friday in Veghel. The customer service team at a sub-€20M Dutch 3PL has 312 unanswered emails, most of them three words long: waar is mijn pakket. The dispatcher has stopped picking up because every call is the same question. The operations lead has a Slack DM from the CEO that says, in full, "can we just put a bot on this." By Monday they want to know which stack to build on. Not which is best in the abstract. Which is best for them.

This is the method we use to get them an answer in a single afternoon. It scores three things only: per-zending cost at their real volume, AVG-defensible logging, and who pages whom at 03:00 when DHL Parcel rotates a SOAP endpoint.

The three stacks on the table

For a sub-€20M Dutch logistics SME running between 12,000 and 25,000 weekly statusvragen, the realistic options are three. We list them as they are usually pitched, not as they perform.

Vercel AI SDK on Postgres

One Next.js app, the Vercel AI SDK for model calls, a Postgres row per zending, a webhook from the FMS. Hosted on Vercel, deployed on every git push. The default for any team that has shipped a website in the last three years.

Self-hosted Dify with Qdrant

Dify orchestrates the agent, Qdrant stores embeddings of the standard answers, everything runs in a docker compose on a Hetzner box in Falkenstein. No vendor lock-in. Two senior engineers, a maintenance budget, and a Sunday evening sometimes spent on the box.

Custom Claude Agent SDK wired into Transics

A purpose-built agent on the Claude Agent SDK, talking directly to the Transics FMS, with a thin SOAP bridge to DHL Parcel and PostNL for carrier handoffs. Lives on the client's existing infra. Reads the FMS in near real time. Two weeks of build, one ongoing retainer.

The three things that actually decide it

Every stack will technically work. None of them are technically the question. We score on three measurable lines instead.

Per-zending cost at 18,000 weekly statusvragen

The vendor pitch is per-message or per-seat. The reality is per-zending: how many euros does a single shipment's worth of customer questions cost you across model calls, hosting, retrieval, and the engineer time to keep it running. A statusvraag usually triggers two to four model turns once you account for clarifications and carrier lookups. At 18,000 weekly statusvragen, you are between 60,000 and 75,000 model turns per week. Numbers in the millions per quarter sound smaller than they are. Anthropic's published API rates give you a concrete ceiling. Real cost lands at two to three times the raw model spend once retrieval, logs, and human review get added.

AVG-defensible logging

The Autoriteit Persoonsgegevens does not care about your stack. It cares whether you can show, on demand, who saw which adresgegevens for how long, and whether the model provider had a verwerkersovereenkomst when it did. The AP's guidance on cloud processing is the document any 3PL DPO will quote back at you. The question to score is not "is this stack secure." It is whether your operations lead can pull a customer's full conversation history, the names of everyone who answered, and a list of which data left the EU, inside five working days, without help from a vendor.

Who patches the SOAP bridge at 03:00

This one decides more deployments than any benchmark. DHL Parcel rotates its tracking endpoint roughly twice a year, usually overnight on a Sunday, with the deprecation notice buried in a portal nobody on your team has the login for. PostNL does the same on a different cadence. The question is not whether your stack supports SOAP. The question is whose phone rings when the bridge starts returning 503s at 03:14 on a Monday morning and the operations lead wakes up to 800 angry tracking emails.

How the three stacks score

Run the same client brief through all three lenses and the picture sharpens.

Vercel AI SDK on Postgres wins on day-one cost. One engineer ships the MVP in a week. The hosting bill is negligible at this volume. It collapses on the AVG line if the team has not separately signed a verwerkersovereenkomst with each model provider, and most teams have not. It collapses on the SOAP line because the 03:00 call goes to the freelancer who built it, who may not work for you anymore.

Dify on Qdrant wins on data control. Everything stays on your Hetzner box. The AVG line is straightforward to defend. The per-zending cost is higher than it looks because two senior engineers cost more per year than the entire Vercel stack will. The SOAP line goes to those same two engineers, who now own every dependency in the compose file.

A custom Claude Agent SDK build wired into the FMS wins on the SOAP line, because the on-call rotation is a line item in the retainer, and the bridge sits in a known place with a known owner. The per-zending cost lands between the other two. The AVG line is defensible if the build uses EU model routing and the logging schema is designed for verwerkersovereenkomst questions from the start, which it should be.

The scoring sheet we actually send

The version we hand to the operations lead is a single page. Three rows, three columns, with a real number in each cell. Not a star rating. A number with a unit and a footnote that says where it came from.

                     Vercel AI SDK   Dify + Qdrant   Custom Agent SDK
Per-zending (€)      0.04 to 0.07    0.11 to 0.18    0.06 to 0.09
AVG defense (days)   5 to 10         1 to 2          2 to 3
SOAP on-call         your eng        your eng        retainer line

The numbers shift per client. The method does not. Anchor each cell in something the operations lead can verify before lunch.

Warning

"Self-hosted" does not mean "AVG-defensible." If the model calls go to a US endpoint, the verwerkersovereenkomst question still applies, and your DPO still has to answer it. Hosting Dify in Falkenstein does not change where the inference happens.

The HN thread you have probably already read

This week's Hacker News front page had a 1,000-comment Ask HN about replacing Claude or GPT with a local model for daily coding. It is a useful read and a misleading one for this decision. The honest answer in the thread is that local models are getting close enough for an IDE assistant. The honest answer for a customer-facing logistics chat agent is that a statusvraag has a much smaller error budget than your editor does, and the carrier-side SOAP rotations have nothing to do with which model you picked. Pick the model conversation last. Pick the on-call and the AVG schema first.

What this gets you on Monday morning

The reason we do the scoring on a single page is that the operations lead has to take it into a Monday standup where the CEO will ask one question: "can we just do it." The page lets them answer "yes, on this stack, for this reason, and here is who answers the phone at 03:00." Anything longer gets re-litigated.

When we built the track-and-trace agent for a Brabant 3PL last winter, the model choice was the easy part: the existing SOAP bridge to DHL Parcel had three different timeout values across three services and no clear owner, so the first endpoint rotation took the agent down for six hours. We rebuilt the bridge as a single service with one timeout and one on-call engineer, which is the shape of work we ship when we build AI agents for Dutch operations teams.

The smallest thing you can do today: open a shared doc, write the three scoring rows above, and try to fill in the third column for your own volume. If you cannot put a real euro number in the per-zending cell by lunch, you do not yet know enough about your own traffic to choose a stack.

Key takeaway

Score a chat agent stack on per-zending cost, AVG defense, and who answers the 03:00 SOAP bridge alarm. The model choice comes last.

FAQ

What does per-zending cost actually include?

Model turns, retrieval, hosting, logging, plus the engineer hours to keep the bridge alive. A statusvraag is two to four turns, not one, once clarifications and carrier lookups are counted.

Is a self-hosted Dify deployment automatically AVG-safe?

No. The host location does not decide the verwerkersovereenkomst question. Wherever the model inference happens is what your DPO has to defend. EU-only routing has to be configured, not assumed.

Why is the SOAP bridge a separate scoring line?

Because DHL Parcel and PostNL rotate endpoints on their own schedule, usually overnight, and the on-call owner of that bridge decides whether your agent survives Monday morning.

When does the Vercel AI SDK stack actually win?

When the team has an in-house engineer who will own the on-call, the volume stays under about 8,000 statusvragen a week, and the customer data never leaves the EU through the model provider.

ai agentschat agentsstrategyarchitectureintegrationsoperations

Building something?

Start a project