Email automation

Email agent for a letselschade firm: a 1,180/week case study

On a Tuesday morning in Almelo, the head of intake stares at 287 unread medisch-adviseur emails. Half are flagged urgent. None of them actually are. We built her an agent.

Jacob Molkenboer· Founder · A Brand New Company· 21 Jun 2026· 8 min

Sealed cream envelope with chartreuse ribbon on a forest-green blotter, stack of letters tied with twine, brass letter opener.

09:14 on a Tuesday

The intake coordinator at a 22-person letselschade firm in Almelo opens Outlook and sees 287 new medisch-adviseur emails since Friday. Forty-one are flagged urgent. Three actually are.

She has been doing the job for nine years. She knows which IME-arts uses the urgent flag like a comma and which AOV-behandelaar only flags when something is genuinely on fire. She also knows that by the time she has sorted the noise from the red flags, the firm's two senior letselschade-advocaten have each been pinged on their phones with "even kijken" messages, and she has lost another forty-five minutes before she can route a single dossier.

This is the email pipeline of a Dutch personal-injury practice in 2026: high-volume, regulated, dependent on two pieces of software that nobody has truly loved since 2014, and bleeding senior attention through the cracks.

The volume problem

The firm runs roughly 1,180 medisch-adviseur correspondenties per week. That is medical advisors replying to questions, AOV insurers responding to aansprakelijkstellingen, hospital records arriving in dribs and drabs, IME reports landing with a soft thud at 23:47 because the IME-arts has only just finished. Each thread needs one of four things to happen next:

A dossier action by the casemanager (the bulk).
An ontvangstbevestiging within the PIV term.
An escalation to a senior letselschade-advocaat (rare, expensive when missed).
Nothing. File the message and move on.

The interesting bucket is the third one. Under the firm's internal rule, any thread that touches an AOV-aanspraak with a vermoedelijke schade above €250,000 must be parked in a senior queue before anybody, agent or junior, sends so much as an acknowledgement. Get that wrong and the firm is one careless line away from a kostenveroordeling argument three years down the road.

Cicero and Exchange 2013

The two systems the agent had to live inside are Cicero, the twelve-year-old advocatensoftware that runs the dossier layer, and a homegrown medisch-dossier-archief built on Exchange 2013 that one of the senior partners' nephews stood up sometime in 2017. Exchange 2013 went out of Microsoft support in April 2023; see the Microsoft Exchange lifecycle notes. The archive is held together at this point with a service account, a scheduled task, and the personal goodwill of one external sysadmin who answers his phone on Saturdays.

Neither system has anything resembling a modern API. Cicero exposes a SOAP-ish endpoint that returns XML if you flatter it. The Exchange archive responds to EWS calls if you authenticate as the right legacy user. There is no webhook anywhere in the building. The first six weeks of the project were not about LLMs; they were about reverse-engineering two closed systems without breaking either one.

Warning

If a vendor tells you the agent will integrate cleanly with twelve-year-old legal software, ask them to show you the Postman collection. Old Dutch verticals (letselschade, notariaat, accountancy) run on closed Windows desktops with no public docs and no vendor incentive to help.

What the agent actually does

The pipeline ended up looking like this. An IMAP listener on the firm's central postvak watches incoming mail. Every message gets pulled through a classifier that does four things in sequence: it extracts the dossier number (regex first, then an LLM fallback for the 14% of messages where the human typed it wrong), it looks up the dossier in Cicero over the SOAP interface, it pulls the most recent medical correspondence for that dossier from the Exchange archive over EWS, and only then does it decide what kind of message this is.

The classification step is a small Claude call with the dossier metadata and the last three messages in the thread as context. It returns a JSON object with the category, a confidence score, and a flag for whether the thread mentions or implies an AOV-aanspraak above the €250k threshold.

def classify(msg, dossier, history):
    prompt = build_prompt(msg, dossier, history)
    result = claude.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=400,
        system=PIV_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": prompt}],
    )
    decision = json.loads(result.content[0].text)

    if decision["aov_value_eur"] and decision["aov_value_eur"] >= 250_000:
        return queue("senior_letselschade", msg, decision)

    if decision["category"] == "ontvangstbevestiging":
        return draft_ack(msg, dossier, decision)

    return queue(decision["category"], msg, decision)

We picked a Sonnet-class model because the work is mostly reading context and producing a small structured output. Bigger models would be wasteful at 1,180 messages a week; smaller models started inventing dossier numbers when the inbound mail was sloppy. The cost-per-message lands at about €0.004, which means the whole classification layer runs for less than the price of one paralegal hour per week.

The €250k threshold is not a vibe. The agent will not draft, send, or auto-acknowledge anything it thinks touches a claim above that line. It parks the thread, attaches the classifier's reasoning as a note, and pings the senior queue inside Cicero. A human reads it. Always.

Drafting under the PIV gedragscode

The PIV-gedragscode, Stichting PIV's code of conduct for the handling of personal-injury claims, is specific about acknowledgements. A reply within two weeks. Language that does not prejudge. No promises about timelines the firm cannot keep. No medical opinion in the body of an administrative reply.

An LLM that has read a few thousand letselschade emails will happily violate all four of those rules on the first try. So the drafter is not "Claude, write a reply." The drafter is a constrained generator: it gets the dossier metadata, the most recent inbound message, a list of forbidden phrases harvested from a year of partner-rewritten drafts, and a strict template skeleton with named slots.

The output is never sent automatically. It lands in the casemanager's outbox marked "concept" with the slot fills highlighted. The casemanager reads, presses send, or rewrites. The agent's job is to make the first 80% of the draft disappear, not to take responsibility for the last 20%.

Takeaway

For regulated work the right autonomy level is "drafts, never sends." The agent should remove typing, not remove judgement. The instant you cross that line you have a compliance problem, not an automation problem.

What broke in the first month

Three things, all instructive.

The Exchange 2013 EWS connection started silently throttling on day eleven. The agent kept polling, Exchange kept returning empty result sets, and the classifier started routing everything as "no prior context," which under the firm's rules pushes threads toward the senior queue. Two senior lawyers had a Wednesday morning of 140 false-positive escalations before we noticed. The fix was three things at once: an explicit empty-result-set alarm, a sanity check that compares EWS message counts hour-over-hour, and a hard cap on senior-queue volume that pages an engineer before it pages a partner.

The €250k extractor was too literal. It looked for euro values in the message body. But AOV-aanspraken are often described as "verlies arbeidsvermogen tot pensioenleeftijd" without ever stating a number, and the actual value lives in an attached actuariële berekening as a PDF. We added an attachment reader and re-prompted the classifier with twelve months of internal "did this turn out to be above €250k" labels as few-shot examples. Accuracy on the threshold flag moved from 71% to 94%.

The PIV term clock was wrong. Exchange 2013 stamps message dates in the server's local time, which is not always Amsterdam time after the server got rebuilt in 2019. Threads arriving Sunday night were being counted as Monday morning, eating a day off the response window. We pinned the timezone in the IMAP-to-archive bridge and rebuilt the queue priorities. Two hours of work; two months of low-grade compliance risk we had been unknowingly carrying.

The numbers after four months

Routing accuracy on the classification step settled at 96.4% measured against casemanager re-routes. The senior queue receives roughly 38 threads a week, down from the firm's pre-agent estimate of "we'd see maybe a hundred true escalations a week if anyone had time to look." Two senior partners now spend their queue time on threads that genuinely belong there.

The casemanager team's email-handling time dropped from a self-reported 2.6 hours per person per day to 0.9 hours. That is not 2.6 hours of work disappearing; it is 2.6 hours of typing-and-routing replaced by 0.9 hours of reading-and-approving. The reclaimed hours moved to dossier work and client calls. The firm has not made its planned fifth-casemanager hire and currently has no plan to.

One number we deliberately do not chase: percentage of acknowledgements sent without human review. That number is zero, and we have written into the contract that it will stay zero.

Where the line sits

There is a useful conversation happening this month about where AI belongs. Norway has just put a near-ban on AI tools in elementary schools, on the grounds that children need to learn the underlying skill before they delegate it. The framing travels. Drafting an ontvangstbevestiging under a known code, against a known dossier, with a human pressing send: that is delegation in the boring, healthy sense. Letting an agent decide whether a €400,000 claim deserves a senior lawyer's eye is not delegation. It is abdication. The €250k threshold exists so the agent never has to make that call.

If you run a regulated inbox

The smallest useful thing you can do this afternoon: open the last 200 inbound messages in your team's main mailbox and label each one with the four-bucket scheme. Action, acknowledge, escalate, file. Count the escalations. If your gut-count was off by more than 30%, your team already has a routing problem that no amount of typing-fast can fix.

When we built this email agent for the Almelo letselschade firm, the surprise was not the LLM; the classifier was the easy part. The hard part was making a 2013 Exchange archive and a 2013-vintage Cicero install behave well enough that a senior partner would trust the queue. That kind of AI agent work is mostly plumbing, contract design, and disciplined scope. The model is the last 10%.

Key takeaway

For regulated email work, set the agent to draft and never send. The right metric is hours reclaimed for judgement, not messages auto-replied to.

FAQ

Can the agent send acknowledgements without human review?

No. Every outbound message lands in a casemanager's outbox marked as a draft. The agent reduces typing, not responsibility. Auto-send was contractually excluded from the scope on day one.

How does it integrate with Cicero and Exchange 2013?

SOAP for Cicero, EWS for the Exchange archive, IMAP on the central postvak. No public APIs, no webhooks, no vendor cooperation. The plumbing is most of the work; the LLM is the last ten percent.

What happens when a thread mentions a large AOV claim?

Anything implying a claim above €250,000 is parked in a senior-letselschade queue before any reply is drafted. A human reads it first. The threshold exists so the agent never has to judge the case itself.

Did the firm cut headcount after rolling this out?

No. They cancelled a planned fifth-casemanager hire and reallocated the reclaimed hours to dossier work and client calls. The team is the same size, doing more billable work per day.

ai agentsemail automationautomationworkflowlegacy sitescase study

Building something?

Start a project