Process automation
CMR + ADR automation: a 45-second escalation playbook
Tuesday 06:47: 614 unread emails, one buried ADR class 3 leak, and an ILT clock already ticking. Here is the agent we built so it never happens again.

Tuesday, 06:47. The wagenparkbeheerder at a 25-truck transportbedrijf just outside Amersfoort opens his Outlook and counts 614 unread items. Eleven are CMR-vrachtbrieven scanned by drivers who finished a shift between 22:00 and 04:00. Three are from the night planner asking him to confirm which trailer is parked where. One, at position 387, is a forwarded WhatsApp from a driver reporting a slow drip from an IBC of ethanol-based cleaner under their truck on the A27. ADR class 3. The legal clock to notify the Inspectie Leefomgeving en Transport started ticking the moment the driver noticed it.
He has, technically, until early afternoon to file it.
He sees the email at 09:18.
This post is the playbook we used to make that not happen anymore. The client is a real Amersfoort transportbedrijf that runs 25 trucks across Benelux and the Ruhrgebiet, processes around 3,200 CMR-vrachtbrieven a week, and is bound by the Wet vervoer gevaarlijke stoffen (Wvgs) every time an ADR consignment touches a Dutch road. We will not name them. Everything else is from the build log.
The constraint that defines everything
Under Wvgs and the underlying ADR Agreement, a transport company has to report a significant escape of a dangerous substance to ILT promptly. The clock is short. The operator has hours, not days, and the ILT inspector who calls back will ask when the company first knew. "Buried in the wagenparkbeheerder's inbox until lunchtime" is not an answer that ends well.
That single regulatory clock is what made the project worth doing. Everything else, the CMR scanning backlog, the planner-rooster mess, the fact that one fitter spends every Friday reconciling Transics data with a printed week schedule, is solvable with patience. The reporting clock is solvable with software, or not at all.
Two systems, neither talking to each other
The fleet runs on Transics, the Belgian telematics product now owned by ZF. The install is from 2013. It tracks position, drive/rest hours, fuel, and a thin layer of trip metadata. It does not know what is on the truck. The shipping manifests, CMR-vrachtbrieven, ADR-bijlagen, weigh tickets, live on paper, then as PDF scans in a shared mailbox.
The planner-rooster is a homegrown Outlook + Exchange 2016 setup. Each truck is a calendar; each shift is a meeting; the dispatcher drags meetings between calendars to reassign work. It is the kind of system you build in a weekend in 2014 and run for twelve years because it works. We were not going to replace it. The point of the project was to read it.
So the agent had to live in the gap. Inputs: the shared Exchange mailbox, the Transics REST feed, and the calendar API. Outputs: a Slack-like escalation queue for the wagenparkbeheerder, a quiet "this CMR is filed and matched" log for the office, and a structured row in a Postgres ledger for the ILT auditor who shows up six months later.
The agent shape
We did not build one big agent. We built four small ones and a router.
- Inbox-reader polls Exchange every 30 seconds, pulls new messages, runs OCR over PDF attachments, and hands a normalised JSON envelope to the router.
- Classifier decides what the message is: routine CMR scan, planner question, breakdown, ADR-relevant melding. Two-stage: deterministic keyword sieve first, LLM second.
- Transics-correlator takes a truck plate and a timestamp, fetches the last known position, current driver, and active trip from the Transics API.
- Escalator takes anything the classifier marks as ADR-3 leakage, opens a ticket in a dedicated queue with a 45-second SLA, and triggers a pager call to whoever is on duty.
The router is fifty lines of Python. It is not an LLM. We learned, painfully, on an earlier project that "let the model decide which tool to call" is a great demo and a fragile production pattern. The principle that survives in production is the one experienced builders keep landing on: deterministic edges, generative nodes. Use the model where ambiguity is genuine. Use code where it is not.
Classifying ADR-3 from the noise
Around one in four hundred inbound messages turns out to be ADR-relevant in any given week. Of those, maybe one in twenty is a class-3 leak rather than a paperwork question. So we are looking for roughly one needle per week in a 3,200-item haystack.
The cheap sieve runs first. It is a Dutch keyword classifier that does not need to be smart, only paranoid.
# adr_triage.py - first-pass before the LLM
ADR3_SUBSTANCES = {
"ethanol", "benzine", "diesel", "aceton", "wasbenzine",
"spiritus", "thinner", "white spirit", "petroleum",
"UN1170", "UN1202", "UN1090", "UN1219", "UN1263",
}
LEAK_VERBS = {"lekt", "lekken", "lekkage", "druipt", "drupt",
"loopt leeg", "spuit", "drip", "leak", "leaking"}
def looks_like_adr3_leak(text: str) -> bool:
t = text.lower()
substance = any(s in t for s in ADR3_SUBSTANCES)
leak = any(v in t for v in LEAK_VERBS)
return substance and leak
Anything that trips the sieve goes to the LLM with the full message body, the OCR'd attachment, the driver's recent messages, and the truck's last GPS ping. The model returns a strict JSON object: { adr_class, event_type, substance, confidence, recommended_action }. We force confidence below 0.7 to escalate anyway, on the assumption that under-escalation is the failure mode that ends careers.
Do not let the model decide the threshold. The threshold is a regulatory question. Put it in code, behind a review, and date-stamp every change. The ILT auditor will ask.
The 45-second escalation queue
Forty-five seconds is not a model latency target. It is the wall-clock budget from "Exchange surfaces the new mail" to "the wagenparkbeheerder's phone is ringing." Inside that budget sit the polling interval, the OCR pass, two classifier calls, the Transics lookup, the ticket write, and PagerDuty's own dispatch.
The routing is boring on purpose:
# routes.yaml
queues:
wagenpark_urgent:
sla_seconds: 45
pager: pagerduty:wagenparkbeheerder
requires_ack: true
fallback_after_seconds: 120
fallback_to: pagerduty:operations_lead
planning_routine:
sla_seconds: 900
pager: null
rules:
- when:
classifier.adr_class: 3
classifier.event: leak
then:
enqueue: wagenpark_urgent
with_context:
- transics.last_position
- transics.current_driver
- cmr.shipper
- cmr.consignee
- attachments.ocr_text
- when:
classifier.event: cmr_scan
then:
enqueue: planning_routine
The requires_ack: true line is the one that matters. If the wagenparkbeheerder does not tap "acknowledged" within 120 seconds, the ticket escalates to the operations lead and then to the company's external safety advisor. The system assumes the on-duty person is dead, asleep, or driving. It defaults to noisy.
Three things that broke in week three
In order of severity.
First, the Transics API key rotated. Nobody told us. Transics rotates keys on a schedule that lives in a PDF in a sales engineer's inbox in Ieper. The correlator started failing silently because we had wrapped its errors in a try/except that logged a warning and proceeded. The escalator still fired, but without the GPS context the wagenparkbeheerder had to call the driver to ask where they were. We fixed it with a health check that pings Transics every five minutes and pages if the call has been failing for ten.
Second, the Exchange mailbox filled up. Exchange 2016 on-prem has a mailbox quota the IT contractor had set at 50 GB in 2017. With the OCR pipeline now pulling every attachment, the mailbox was hitting cap and rejecting inbound mail on Tuesdays. The fix was to move the shared mailbox to a 200 GB archive and to tier processed attachments to S3 with a thirty-day local cache.
Third, and most embarrassing, the LLM hallucinated a UN number. A driver wrote "klein lek bij de IBC, denk dat het schoonmaakmiddel is", small leak at the IBC, I think it is cleaning product. The model returned UN1170 ethanol, confidence 0.62. The substance was actually a non-ADR alkaline cleaner. The escalator fired correctly, the wagenparkbeheerder reached the driver in seventy seconds, no real harm done. But it was a reminder. We added a rule: if the substance is not visually identifiable from the CMR or a photo attachment, the model is required to set substance: "unknown" and let the human classify.
Numbers after six weeks
Over the first six weeks of production we processed 19,114 inbound messages. The classifier flagged 47 ADR-relevant items and escalated 6 of them as class-3 leakage candidates. Four were real leaks (two minor, two requiring ILT notification). Two were false positives, one a cracked windscreen wash bottle, one a driver describing a previous incident in past tense. Median time from inbound mail to wagenparkbeheerder acknowledgement on the real ones: 38 seconds. The longest was 71 seconds, on a Sunday at 03:14, when the pager had to fall through to the operations lead.
The boring metric matters more. The CMR scan reconciliation that used to take the Friday fitter four hours now takes seventeen minutes of his time. The agent matches the scan to the planner-rooster entry and the Transics trip, and surfaces only the ones that do not match cleanly. The shared mailbox at 09:18 on a Tuesday now has 23 unread items, not 614.
Identity and audit, briefly
The audit story matters as much as the escalation story. For a Dutch transport client who needs a clean trail to show ILT, every model call should be tied to a named human owner, and every external API call should use a short-lived, scoped credential rather than a long-lived service key. We rotate the Transics correlator's credentials on a four-hour window now. Had that pattern been in place at the start, the silent key-rotation failure in week three would have surfaced as a credential-issuance error within hours, not as a slow degradation we noticed days later.
Build agents around the regulator's clock, not the user's convenience. Everything in the architecture follows from one number you cannot negotiate.
The smallest thing you can copy this week
If you run an operations inbox that legally has to escalate certain messages within a window, do this on Monday. Write a two-column list. Left column, the message types that must escalate. Right column, the actual wall-clock budget for each. If the right column has a number you currently miss more than once a quarter, you have a process-automation problem that pays for itself. The rest is plumbing.
When we built this AI agent for the Amersfoort transportbedrijf, the thing we ran into was that the regulator's clock and the model's confidence threshold are the same kind of variable, and both belong in version-controlled code rather than in a prompt. We ended up moving every threshold into a small YAML file that the safety advisor reviews and signs off on each quarter.
Key takeaway
Build the agent around the regulator's clock, not the user's convenience. The ILT reporting window is the one variable you cannot negotiate away.
FAQ
Why a 45-second SLA and not 30 or 60?
Forty-five seconds is the wall-clock budget that fits polling, OCR, two classifier calls, the Transics lookup and a PagerDuty dispatch with margin. Tighter forces skipping the second LLM check; looser starts to feel routine to the on-duty operator.
Does the agent file the ILT report automatically?
No. The agent escalates, packages the context, and drafts a report. A named human at the company submits it. Regulatory filings should not be the model's responsibility, and ILT will want a named human signatory.
Why not replace the Exchange 2016 planner-rooster?
Because it works, the team knows it, and replacing a twelve-year-old planning system is a separate twelve-month project. The agent reads the calendars instead. Migrations earn their cost when the old system blocks the change, not before.
What stops the LLM from hallucinating a UN number again?
If the substance is not visible on the CMR or in a photo attachment, the model must return substance unknown and route the item to a human classifier. We treat hallucinated certainty as a failure mode and design the prompt to default to ignorance.