← Blog

Chat agents

WhatsApp chat agent for trucking: a 45-second handoff

A driver in a turnout outside Venray sends one WhatsApp line at 04:53. By 04:54:38 the load is logged, the customer is notified, and dispatch is asleep. Here's how.

Jacob Molkenboer· Founder · A Brand New Company· 29 Jul 2024· 9 min
Paper slip, brass desk bell with green ribbon, linen handkerchief and leather key fob on ivory blotter.

The driver pulls into a gravel turnout north of Venray, kills the engine, and types one line into WhatsApp: "load 88421 collected, scan coming." The timestamp is 04:53. By 04:54:38 the broker's TMS shows the pickup logged, the customer has been notified, the next leg is queued in the driver's chat, and the CMR photo is OCR'd and filed against consignment 88421. No dispatcher has woken up yet. The first one on shift starts at 05:30.

The broker runs 60 tractors out of Venlo, mostly NL-to-NRW lanes with a handful of runs into Bavaria and back-haul work into Greater Poland. Before we got the call, fourteen dispatchers handled inbound driver traffic across WhatsApp, phone, and email. Two had quit because their evening shifts ran past midnight just clearing the queue. The owner described it as "the WhatsApp problem." He was right to call it that. Phone is dying in road freight. Email never caught on. Drivers, every nationality on the floor, send everything through WhatsApp.

This is what we built, why the 45-second number is real, and the parts that almost broke it.

The 04:53 problem

A broker's job is to match loads to trucks and keep both sides happy. The dispatcher's day is a thousand micro-decisions. A driver says he's running 40 minutes late, dispatch has to know whether that breaks the slot at Aldi Geseke or only nudges it. A driver collects a load and the customer needs a confirmation in the portal within an hour. A driver opens his trailer at the consignee and the seal number doesn't match the CMR. Each of those starts as a WhatsApp message.

The shape of the inbox is bimodal. About 80% of messages are routine state updates that map cleanly to a handful of TMS actions. The other 20% is anything from a wrong-document scan to a driver telling dispatch the load on the trailer is not the load on the paperwork. The first bucket should never touch a human. The second bucket should reach a human in seconds, not in the 14 minutes the broker was averaging.

That's the shape of the agent we shipped.

Five intents, mapped to five TMS actions

We spent the first week reading 2,300 archived messages from the broker (with consent, anonymised at ingest) and clustering them by intent. The long tail is real, but the head is short:

  • Load acceptance. "loaded 88421", "collected", "got it", with or without a reference.
  • ETA update. "traffic A2, +40", "vertraging brücke leverkusen", "stuck at consignee".
  • Document upload. A photo of a CMR, POD, or weigh-ticket, sometimes with a one-word caption.
  • New-load response. The agent offered a back-haul, the driver says yes, no, or asks a clarifying question.
  • Dispute or anomaly. Anything that doesn't match the above, weighted heavily toward "this load is wrong".

The first four are deterministic round-trips with the broker's TMS, a custom Symfony stack with a thin REST layer we'd already mapped. The fifth is the one with teeth.

What's actually under the hood

The agent is small, on purpose. We did not build a general agent with a sandbox and a planning loop. Trucking is too narrow and too unforgiving for that.

Inbound: Meta's WhatsApp Cloud API hits a webhook on our Hetzner box, which writes the message to Postgres and pushes a job onto a Redis queue. A worker pulls the job, hydrates the driver's context (last 20 messages, current load, truck assignment, language preference, dialect notes), and calls a classifier. The classifier is an LLM with a system prompt of about 700 tokens and a JSON-schema response. No tools, no recursion. It returns one of the five intents plus the extracted slots.

From there it's boring code. A Symfony controller per intent. The load-acceptance handler hits the TMS, marks the consignment loaded, fires the customer-facing webhook, and queues the outbound WhatsApp template. The document handler sends the image to a vision model, runs OCR for the load reference, and files it against the right consignment. Boring is the point.

Warning

WhatsApp's 24-hour customer service window will silently break your replies if you treat the API like SMS. Outbound messages to a driver who hasn't messaged you in 24 hours must use a pre-approved template, not free-form text. We learned this on a Sunday at 11pm when the queue choked on 180 driver follow-ups.

How a dispute lands on dispatch in 45 seconds

This is the part the owner cared about. The other intents are nice to automate. Disputes are where money leaks.

When the classifier returns intent: dispute with confidence above 0.6, the agent does four things in parallel:

  1. Replies to the driver: "Begrepen. Eric van dispatch komt erbij, max 5 min." The driver knows a human is on it. They stop sending the same message in three different forms.
  2. Builds a structured handoff packet: driver name, truck, load ref, the dispute summary in one sentence, the last six messages verbatim, any photos, the customer contact, the consignee opening hours, and the broker's margin on the load.
  3. Posts the packet to the dispatch room in Microsoft Teams, with an @-mention of whoever is on the rota for that lane in that hour.
  4. Opens a TMS case with the same packet stapled to the consignment, so the dispatcher's next click goes to the right screen.

The on-call dispatcher reads one Teams post and has the whole situation. They don't scroll WhatsApp. They don't ask the driver to repeat himself. They open the TMS, see the packet, and decide. Median time from a driver's first dispute message to a dispatcher replying on WhatsApp is 41 seconds. The 45 is the 75th percentile.

The 45-second budget, line by line

Time is the thing the owner watches. Here is how we account for it:

webhook ingest          0.4 s
queue + hydrate context 1.2 s
classifier round-trip   2.8 s
TMS write or read       1.5 s
outbound WhatsApp       0.9 s
----------------------------
routine intent total    6.8 s p50

dispute branch adds:
structured packet build 1.1 s
Teams post + @-mention  0.7 s
TMS case open           1.4 s
dispatcher first reply  ~30 s (human)
----------------------------
dispute total           41 s p50 / 45 s p75

The human still does the work. The agent just deletes the 13 minutes of latency that used to come from a dispatcher discovering the message, reading three days of context, switching tabs, and finding the right consignment.

The parts that almost broke it

None of this was as clean as the architecture diagram. Three things tripped us in production.

Polish drivers writing in mixed Dutch. The classifier was tuned on the broker's archive, which leans Dutch and German. A Polish driver wrote "loaded ale brak dokumentu", mixing Polish with Dutch logistics jargon, and the classifier punted to dispute. False positives waste the dispatcher's time exactly as badly as false negatives waste the driver's. We added a few-shot block of multilingual mixed-code examples, retuned the confidence threshold per language, and the false-dispute rate dropped from 7% to under 1%.

Photo-only messages with no caption. Drivers send a photo of a CMR and nothing else. The agent has to infer the consignment from the photo, not from the message. We initially asked the LLM to do that in one shot. It hallucinated load references that looked plausible but didn't exist. We switched to a two-step pipeline: OCR the visible text, then look up the load reference against the driver's currently active consignments. If there's no match, the agent replies "welk loadnummer?" instead of guessing.

The 24-hour service window. See the callout above. Templates are a tax you pay for the privilege of running on WhatsApp's rails. Plan your template inventory before you ship, not after. We now keep eleven approved templates covering the broker's common outbound patterns, including the one Eric sends from his Teams sidebar when he wants to take a dispute private.

Why a small agent beats a big one here

There is a fashion right now for general-purpose agents that plan, browse, and self-correct. It works well in some places. In a logistics broker's WhatsApp inbox, it would be a liability. The cost of a wrong booking is real money. The cost of a hallucinated load reference is a driver standing at the wrong warehouse at 03:00.

The job of the system around the model is to make sure "sometimes wrong" doesn't reach production unchecked. For us that means short prompts, JSON schemas, deterministic handlers, and a human in the loop the moment confidence drops. The model is a component. If the model is the system, the system is the model's worst day.

Numbers, four weeks in

The broker measures three things and only three things.

Dispatcher minutes spent on WhatsApp per day, across the team, dropped from 38 hours to 9. They have two fewer night shifts on the rota. Driver complaint volume against dispatch, measured by the broker's quarterly driver survey, fell by 31% (sample of 47 drivers, take the size with the seriousness it deserves). Average time from a driver's first dispute message to a dispatcher's first reply on WhatsApp went from 14 minutes 22 seconds to 41 seconds.

One number that didn't move: booking accuracy. We were watching for any drift in load assignments. There is none. The agent doesn't book loads it isn't certain about. It asks.

What this means if you are a broker reading this at 11pm

You probably don't need a general agent. You need a small, narrow one that handles the head of your inbox cleanly and hands the tail to a human with all the context already loaded. That is a four-week build, not a four-quarter one, if you have a TMS with an API and an owner who can sit with us for an afternoon and label messages.

When we built this agent for the Venlo broker, the part that almost killed the project wasn't the model. It was the TMS returning a 200 with a Dutch error string in the body for half the failure modes, which we solved by writing a thin adapter that parses the body and treats those as the failures they are. The longer version of how we approach AI agents like this one lives on the services page.

The smallest thing you can do today: open your dispatch WhatsApp, scroll back 24 hours, and count how many distinct intents you see. If the top five cover more than 75% of the volume, you have a chat-agent-shaped problem.

Key takeaway

Small narrow agents beat big general ones in logistics. Hand the messy 20% to a human with all the context already loaded, and you delete minutes, not seconds.

FAQ

Why WhatsApp and not a custom driver app?

Drivers already have WhatsApp. Forcing them onto a new app means training, support tickets, and refusal. Brokers that try this almost always end up running both channels and doubling their inbox.

What happens if the classifier misreads a message?

Low-confidence classifications fall through to a human queue with the full context attached, so a dispatcher sees the message and can correct the routing. False positives are logged and fed back into the few-shot examples weekly.

How long did the build take end to end?

Four weeks from kickoff to production for the routine intents, plus two more weeks of hardening for the dispute branch and the multilingual edge cases. The TMS adapter was the longest single piece of work.

Does the agent ever book a load on its own?

Yes, for the new-load response intent against the driver's already-assigned shortlist. It will not book anything the dispatcher has not pre-cleared for that driver and that day. The boundary is enforced in the TMS, not in the prompt.

chat agentsai agentsautomationcase studyintegrationsworkflow

Building something?

Start a project