Email automation
Dutch inbox triage agent: a 1,400-email-a-day playbook
A 19-person municipal advisory firm in Zwolle was drowning in citizen email. Here is the architecture, the schema, and the five things that almost broke it.

The intern with the spreadsheet
Tuesday, 08:47. The shared inbox at our client's Zwolle office shows 643 unread messages. An intern has a spreadsheet open with a column labelled soort. She reads each email, types something like WMO-bezwaar or bijzondere bijstand into the cell, then forwards the message to whichever caseworker handles that bucket today. By Friday her spreadsheet becomes the routing schedule.
The firm employs nineteen people. They advise citizens facing municipal decisions: rejected care applications, social-assistance disputes, WOZ objections, debt counselling. The inbox averages 1,400 emails per working day, peaks near 2,100 around quarterly tax cycles, and the intern's spreadsheet is the only thing keeping it sorted. When she is sick, the queue grows.
The firm asked us for one thing. Get the spreadsheet out of the loop. They did not want a chatbot. They did not want auto-replies. They wanted the same routing decisions she was making, but in under ten seconds and without a human in the middle.
This is what we shipped, and what we got wrong.
What 31 case types actually means
Before you touch any model, you have to decide what the categories are. This sounds trivial. It is not.
The intern's spreadsheet had 47 distinct labels at the start of the project. Some were genuine case types (WMO-bezwaar). Some were status notes (wachten op stukken). Some were typos (WMO-bezwaaar). Two of them, overig and weet niet, accounted for 18% of the rows.
We spent the first week not writing code. We sat with the senior caseworker and the office manager and walked the list. We collapsed seven WMO variants into three. We split bezwaarschrift into four sub-types because the routing actually differed. We deleted overig entirely, because the point of the system is to refuse to use it.
We landed on 31. The schema looked like this:
case_types:
- id: WMO_INDICATIE_BEZWAAR
label_nl: "Bezwaar tegen WMO-indicatie"
description_nl: |
Burger maakt formeel bezwaar tegen een door de gemeente
afgegeven WMO-indicatie. Vaak gaat het om afgewezen
huishoudelijke hulp, verlaagde uren, of geweigerde dagbesteding.
keywords: [WMO, indicatie, bezwaar, huishoudelijke hulp, afwijzing]
negative_examples:
- "Vraag over WMO-aanvraag"
- "Wanneer komt de thuiszorg langs?"
sla_hours: 48
requires: [WMO_specialist]
Every case type carries an SLA, a required-specialist tag, and a small set of negative examples. The negative examples matter as much as the positive ones. Vraag over WMO is not a bezwaar. Mijn moeder belde mij gisteren is not a case at all.
If you only take one thing from this section: the schema is the product. A model running against a sloppy schema produces sloppy decisions, no matter how big the model is.
Why embeddings beat a fine-tuned classifier here
Our first instinct was to fine-tune a Dutch BERT classifier on a few thousand labelled emails. It would have worked. It would also have rotted within four months, because municipal language shifts (new policies, new ministerial letters, new acronyms) and the firm did not want to retrain quarterly.
We went with a two-stage retrieval-and-judge pipeline instead. Stage one: embed every incoming email and find the five closest historical cases. Stage two: hand those five cases plus the new email to an LLM and ask which case type fits.
The advantages are not subtle. Adding a new case type is editing a YAML file and labelling three example emails. The training data is just the firm's own history. And because every decision cites the historical cases it relied on, the senior caseworker can audit a wrong call in about fifteen seconds.
The disadvantage is latency. Two model calls per email costs you maybe three seconds. We will come back to that.
The two-stage pipeline
The pipeline runs on a small Hetzner VM in Falkenstein. EU data residency was non-negotiable for citizen email. Postgres for state, pgvector for embeddings, a Python worker that listens to Microsoft Graph change notifications.
The classifier looks like this, stripped to the bones:
def classify_email(email: ParsedEmail) -> Decision:
# Stage 1: retrieve nearest neighbours
query_vec = embed(f"{email.subject}\n\n{email.body_text[:4000]}")
neighbours = db.execute("""
SELECT id, case_type, subject, body_snippet, decided_by
FROM cases
WHERE embedding <#> %s < 0.35
ORDER BY embedding <#> %s
LIMIT 5
""", (query_vec, query_vec)).fetchall()
if not neighbours:
return Decision(case_type=None, confidence=0.0, reason="no_neighbours")
# Stage 2: judge against the schema
prompt = build_judge_prompt(email, neighbours, CASE_TYPES)
verdict = llm.complete(
prompt,
temperature=0,
response_format={"type": "json_object"},
)
parsed = json.loads(verdict)
return Decision(
case_type=parsed["case_type"],
confidence=parsed["confidence"],
reason=parsed["reason"],
cited_cases=[n.id for n in neighbours],
)
The judge prompt is in Dutch. The model sees the email, the five neighbours, the full schema, and one instruction: pick a case type or refuse. The refusal path is the most important line of code in the whole system.
Routing is harder than classifying
Classifying is the easy half. Once you know an email is a Bezwaar tegen WMO-indicatie, you still need to answer: who handles it today?
A firm of nineteen has, in practice, about six caseworkers who can credibly handle a WMO appeal. Two of them are also senior, which means they take complex cases. One is part-time. One is on holiday this week. One mentors a junior who needs harder work this month.
We modelled routing as a small constraint solver over a table:
CREATE TABLE routing_state (
caseworker_id uuid PRIMARY KEY,
active_today boolean NOT NULL,
specialisms text[] NOT NULL,
weekly_load int NOT NULL,
capacity_weekly int NOT NULL,
prefer_complex boolean NOT NULL DEFAULT false
);
-- pick the cheapest available caseworker for this case type
SELECT caseworker_id
FROM routing_state
WHERE active_today
AND $1 = ANY(specialisms)
AND weekly_load < capacity_weekly
ORDER BY weekly_load ASC, prefer_complex DESC
LIMIT 1;
The office manager updates active_today and weekly_load from a small admin page each morning. That five-minute ritual replaces the intern's spreadsheet.
We thought about making the agent infer availability from Outlook calendars. We tried it. It was too clever and too brittle. The office manager preferred a button she could press.
The escape hatch nobody talks about
Every triage agent we have ever shipped has one feature that the demo never shows: a queue for “I don't know.”
If the classifier's confidence is below 0.75, or if no historical neighbour is within distance 0.35, or if two top case types are within 0.05 of each other, the email goes to a needs_human queue instead of being routed. The senior caseworker clears that queue twice a day. Every clear is itself a labelled example that flows back into the embedding store.
In month one, 22% of emails landed in the queue. In month three, it was 6%. The model never got smarter. The schema and the example bank did.
If your triage agent does not have a “send to human” outcome, it is not a triage agent. It is an auto-deleter with a confidence interval.
Connecting to Microsoft 365
The firm's inbox lives in Exchange Online. The clean way to react to new mail is the Microsoft Graph change-notifications API, which posts a webhook every time something changes in the watched mailbox.
POST https://graph.microsoft.com/v1.0/subscriptions
Content-Type: application/json
{
"changeType": "created",
"notificationUrl": "https://triage.client.nl/graph/notify",
"resource": "users/info@client.nl/mailFolders('Inbox')/messages",
"expirationDateTime": "2026-06-12T18:00:00Z",
"clientState": "shared-secret-here"
}
Subscriptions expire after roughly three days and need to be renewed. We run a cron that renews at 50% of the lifetime, with a fallback poll every five minutes in case a webhook drops. The Graph docs describe the change-notification lifecycle in more detail than is fun to read, but the renewal pattern is non-negotiable.
One gotcha worth flagging: Graph notifications do not include the email body. They give you a message ID. You fetch the body with a second call. If you want lower latency, prefetch the body inside the webhook handler before queuing the classification job.
Eight seconds, end to end
The SLA we wrote into the contract was ten seconds from “email lands in inbox” to “email is categorised, assigned, and a Teams notification is sent to the caseworker.” In practice we run at a median of 7.4 seconds, p95 at 11.2.
The budget breaks down like this:
- Webhook delivery from Graph: 1.5 s median (you cannot fix this)
- Body fetch: 0.4 s
- Embedding call: 0.6 s
- Postgres neighbour query: 80 ms
- LLM judge call: 3.8 s
- Routing query and assignment write: 50 ms
- Teams notification post: 0.9 s
The LLM call is the biggest line item and the one place we keep iterating. Smaller, faster, EU-hosted Dutch-capable models are catching up; swapping in a newer one in March cut about a second off our median. There has been a chorus on Hacker News this week arguing that AI progress is slowing down. From the frontier it might be. From where we sit, deploying narrow agents into operations teams, the boring middle of the curve is still delivering measurable gains every quarter.
What we got wrong
Three things, in order of how badly they bit.
First, we treated the schema as fixed too early. After six weeks the senior caseworker quietly asked us to add a 32nd case type for a new policy line. We had hard-coded 31 in two places. We do not do that anymore.
Second, we under-weighted the office manager. She is the only person who knows on Tuesday morning that two caseworkers will be at a court hearing on Thursday. Building her admin page well (three fields, no scrollbar, saves on blur) made the routing accurate. Building it badly made the routing wrong in a way that looked like a model failure.
Third, we let the firm see confidence scores too soon. They started second-guessing the model on cases where the confidence was 0.81 but they “had a feeling.” We hid the score from the assignment screen and only kept it on the auditor view. Trust went up.
When we built this Dutch email automation for the Zwolle firm, the thing we kept underestimating was that the model was the small part. The schema, the routing table, and the human escape hatch carried the system; the LLM was a component we could swap.
If you are about to ship something like this, the smallest useful thing you can do today is open your shared inbox, scroll back two weeks, and label every message by hand into the case types you think you have. You will be wrong about a third of them. That is the project, before any code.
Key takeaway
The schema and the routing table carry an inbox-triage agent. The model is a component you can swap.
FAQ
Why not fine-tune a Dutch classifier instead?
Fine-tuned classifiers rot when policy language shifts. Embedding retrieval plus an LLM judge lets you add a case type by labelling three emails, and every decision cites the historical cases it relied on.
How do you keep citizen data inside the EU?
We host the worker on a Hetzner VM in Falkenstein, route to an EU-only LLM provider, and keep embeddings and case bodies in Postgres on the same VM. No citizen content leaves the EU.
What happens when the agent is not confident?
Anything below 0.75 confidence, or where two case types are within 0.05 of each other, goes to a needs_human queue. The senior caseworker clears it twice a day, and each cleared case becomes a labelled example.
How long did the build take end to end?
About ten weeks. The first three were schema work with no code. Six weeks of build and parallel-running alongside the intern's spreadsheet. One final week of shadow mode before the spreadsheet was retired.