Email automation
Email automation for youth care: a Utrecht case study
On a Tuesday at 08:12 the operations lead opens her inbox and finds 312 referrer emails from the weekend. The week before, the agent handled 1,420.

On a Tuesday morning at 08:12 the operations lead at a 27-person jeugdzorginstelling in Utrecht opens her inbox and finds 312 referrer emails from the weekend. Some are from huisartsen. Two are from a wijkteam. One, buried fourteen messages deep, is a melding from a school internal counsellor describing a sixteen-year-old who showed up with bruises she could not explain. The team rule is: that message needs to be in front of a gedragswetenschapper within thirty minutes. It has been sitting unread for forty-one hours.
That was the situation when we started. Twelve months later the same operations lead opens her Tuesday inbox and finds zero unread items. The agent has triaged 1,420 emails over the week, drafted 1,118 ontvangstbevestigingen, parked 23 messages in a gedragswetenschapper-queue, and asked a human for a decision on 39 edge cases. The bruises-message would now reach a behavioural scientist in under four minutes.
This is the engineering story behind that change. Some of it is boring. Most of the value came from the boring parts.
The shape of the problem
Jeugdzorg in the Netherlands runs on referrer correspondence. A huisarts emails a vraag, a wijkteam forwards a melding, a school sends a zorgsignaal, a Veilig Thuis officer asks for an update on an open dossier. Every one of those emails has a different intake path, a different urgency, and a different legal clock attached to it. Under Jeugdwet artikel 7.3.8 a jeugdhulpaanbieder must register the cliënt's information correctly and within reasonable time of receipt, which the inspectorate reads in practice as the same working day.
The instelling we worked with has 27 employees: eleven gedragswetenschappers and jeugdhulpverleners, six ambulante medewerkers, four schoolzorgcoaches, three office staff, two managers, and one IT-coördinator who is also the controller. Inbox triage was manual. The two intake-coördinatoren shared a generic mailbox, retyped fields into Nedap Ons by hand, and at the end of every day one of them was still answering ontvangstbevestigingen at 21:00 from home.
The stack was as legacy as it sounds:
- Nedap Ons, the dossiersysteem, in production since 2013. SOAP API underneath, with a REST shim added by Nedap in 2021 that covers about 60% of the surface we needed.
- An on-prem Exchange Server 2013 archive. Seven years of correspondence lived there, indexed by
From:and almost nothing else. - Outlook 2016 on every desk, still binding to that Exchange server over RPC-over-HTTP.
- A two-step approval workflow in a SharePoint 2013 list that nobody under 35 understood.
The constraint that mattered most was not technical. The Jeugdwet does not let you delegate the eerste beoordeling of a melding met acuut veiligheidssignaal to anything that is not a gedragswetenschapper. Whatever we built had to be allergic to that category of email. The agent's first job, before drafting a single reply, was to refuse to draft.
The safety-signal classifier comes first
We started with the smallest possible model and the smallest possible promise: never miss an acuut signaal. We were willing to over-flag. We were not willing to under-flag.
The classifier runs in two passes. The first is a rule layer: dumb keyword and structural checks against a list of phrases that the gedragswetenschappers themselves wrote down during a two-hour workshop. The list is shorter than you would expect — "ik weet niet meer wat ik moet doen", "suïcidale uitingen", "fysiek geweld vanavond", and another fourteen entries. The second pass is a Claude Sonnet call with a system prompt the team drafted, a few-shot of seventeen real anonymised meldingen, and an output that has to be one of three strings: acuut, spoed, regulier.
def triage(email: ParsedEmail) -> Triage:
if rule_layer_hits_acuut(email):
return Triage(level="acuut", reason="rule", confidence=1.0)
verdict = claude_classify(
system=SAFETY_SYSTEM_PROMPT,
few_shots=ANON_SHOTS,
user=email.plain_text,
)
# We bias toward escalation on disagreement.
if verdict.level == "regulier" and rule_layer_hints_spoed(email):
return Triage(level="spoed", reason="rule_override")
return verdict
Anything tagged acuut gets pushed into a dedicated queue in Microsoft Teams, pings the gedragswetenschapper on duty, and stops. No autoreply. No draft. No "we hebben uw bericht ontvangen". The agent's silence is the feature. An automated ontvangstbevestiging on a message that says a child is in danger reads, to a referrer, as a system that did not understand the urgency.
If you are automating inbound mail in any care, legal, or safety-of-life domain, build the "do nothing" path first and ship it before the "draft a reply" path. The agent has to earn the right to write.
Talking to a thirteen-year-old dossier system
Nedap Ons is not a hostile API. It is a system designed for human screens and then partially exposed for machines. The REST endpoints that exist are well-documented. The ones that don't, aren't. We needed to do four things per intake email: find an existing cliënt by BSN-fragment-or-name-or-geboortedatum, attach the email as a correspondentie-item, write a verrichting for the ontvangstbevestiging, and read back the trajectstatus so the draft could reference where the cliënt was in their hulpverleningsplan.
Three of those four had REST endpoints. The fourth, reading trajectstatus, was SOAP only. We wrapped both behind a thin Python adapter that the agent treats as one tool surface.
class OnsClient:
def find_client(self, hint: ClientHint) -> Optional[Client]: ...
def attach_correspondence(self, client_id: str, email: ParsedEmail) -> str: ...
def write_verrichting(self, client_id: str, kind: str, body: str) -> str: ...
def get_traject_status(self, client_id: str) -> TrajectStatus: ...
The find_client step is where almost all errors live. Verwijzers misspell surnames, give a roepnaam instead of a voornaam, use an old address. We chose to never auto-create a cliënt; if the agent could not find a match with high confidence, the email went to a "needs human match" queue. That queue runs at about nine emails per day. The cost of a wrong attachment is a privacy incident under the AVG, so the trade is not close.
Reading seven years of Exchange 2013
The archive was the surprise. The first version of the agent ignored it; we shipped two weeks in and the gedragswetenschappers started flagging that the drafts read context-free. A schoolzorgcoach would write "naar aanleiding van ons gesprek vrijdag" and the agent had no idea what gesprek. The archive had the previous correspondence (eighteen months of it for an active dossier) but Exchange 2013 EWS does not love being scraped.
We did the unglamorous thing. We wrote an EWS-pulling worker that runs at 03:00, drops new items into a Postgres table with a tsvector column, and gives the agent a search_archive(thread_id | sender, max_results) tool. The tool returns at most six prior messages, plain text, with PII reduction applied. The agent never touches Exchange directly. The archive worker is one file and 240 lines of code. It has not been changed in eight months.
Exchange 2013 has been out of extended support since April 2023. The instelling knew. The migration to Exchange Online was already on the 2026 roadmap. We made a deliberate decision to build the worker against the version that existed, not the version that would exist, because shipping value in 2025 was worth more than future-proofing against a migration the IT-coördinator was already scoped for.
Drafting the ontvangstbevestiging
The drafting step is the part most case-studies start with and we are putting near the end on purpose. By the time the agent gets to write a reply, all the hard work is done: the email is classified, the cliënt is matched, the correspondence is attached, the archive context is fetched, the trajectstatus is read. The draft itself is a Claude call with a system prompt that names the Jeugdwet 7.3.8 verplichtingen explicitly, the cliënt's first name, the verwijzer's organisation, the timeline the team commits to, and a copy of the previous three replies from the same intake-coördinator so the tone matches.
Every draft lands in a review queue. The intake-coördinator either presses Send, edits and sends, or rejects. We log the rate of each. At month one the send-without-edit rate was 38%. At month nine it is 84%. The difference is not the model. The difference is that we kept adding the rejected drafts back to a few-shot file the agent reads from.
The drafting prompt is the smallest part of a working email agent. The triage layer, the integrations, and the human review loop are where 90% of the build lives.
What the numbers look like at month twelve
Over the week ending 14 June 2026, the agent processed 1,420 inbound referrer emails. Of those:
- 23 were flagged
acuutand parked in the gedragswetenschapper-queue with no draft. Median time to human pickup: 3 minutes 41 seconds. - 118 were flagged
spoed, drafted, and put at the top of the review queue with a two-hour SLA. - 1,118 were routed as
regulier, drafted, and sent within the same working day. - 39 went to the "needs human match" queue because the agent could not confidently link them to a cliënt.
- 122 were spam, newsletters, or out-of-scope and silently archived.
The two intake-coördinatoren now spend roughly 90 minutes a day on email, down from a self-reported six hours each. They have used the freed time to take over the schoolzorg-coach scheduling, which had been the IT-coördinator's least favourite recurring meeting. Nobody at the instelling has left the team in the last twelve months. We do not claim the agent caused that. We do claim it stopped contributing to the case for leaving.
What we would do differently
Three things, plainly.
One: we underestimated how much of the build was data-cleanup on Nedap Ons. About a fifth of the cliënt records had stale verwijzer-velden, which broke our matching heuristics the first time we ran on real volume. We should have asked for a one-day data audit before writing a line of agent code.
Two: the rule layer in the safety classifier should have been version-controlled from day one, not maintained in a shared document. We migrated it to a Git repo in month five. The gedragswetenschappers learned to open a pull request. That sentence still surprises me.
Three: the Exchange 2013 worker should have shipped in week one, not week three. The drafts without archive context were worse than no drafts. They read confident and wrong, which is the worst failure mode for a system that touches care.
When we built the email-agent for this Utrecht jeugdzorginstelling, the thing we kept running into was that the safety-signal path and the draft path had to be two separate systems, not two prompts on the same chain. We ended up solving it by treating the triage classifier as a one-job microservice with its own escalation queue, and only allowing the drafting agent to read its output, never its input. If you are looking at a similar build, that is the structural decision to make first — and it is the kind of call we make on every AI agent engagement.
The five-minute audit you can run tomorrow
Open your shared inbox. Pick the last fifty inbound emails. For each one, write down the latency between arrival and first human action. Plot the distribution. If the long tail crosses any hard legal or safety threshold (yours will be different from the Jeugdwet, but you have one), that is the email you automate first. Not the highest-volume one. The highest-stakes one.
Key takeaway
In a regulated inbox, build the do-nothing path before the draft path. The agent has to earn the right to write — silence on the wrong message is a feature.
FAQ
Why doesn't the agent autoreply to messages flagged as an acute safety signal?
Because under the Jeugdwet the first assessment of an acute signal has to be made by a gedragswetenschapper. A machine-generated acknowledgement on that kind of message reads as a system that didn't understand the urgency, which is worse than silence.
How does the agent handle privacy under the AVG?
Email bodies are processed in-region, archive search results are PII-reduced before they reach the model, and the agent never auto-creates a cliënt record. Low-confidence matches go to a human queue instead of being attached to the wrong file.
Did you have to migrate off Nedap Ons or Exchange 2013 to make this work?
No. We built thin adapters against both systems as they were. The Exchange Online migration was already on the 2026 roadmap, but shipping the agent didn't have to wait for it.
How long did the build take end-to-end?
About twelve weeks to first production traffic on a single intake mailbox, then another four months of iterating on the rule layer, the matching heuristics, and the few-shot drafts before the team trusted it on the full inbox.