← Blog

Chat agents

WhatsApp triage for Dutch homecare: a 540-msg/day playbook

A homecare coordinator in Almere starts her Tuesday with 312 unread WhatsApps before 09:00. Here is the agent we built to triage them without losing a single urgent message.

Jacob Molkenboer· Founder · A Brand New Company· 14 Nov 2024· 9 min
Open leather logbook on ivory desk, paper slips tied with twine, green tab, brass bell, red wax seal, side light.

The Tuesday morning starts at 07:50. Marleen, the planning coordinator at a 36-person thuiszorg organisation in Almere, opens WhatsApp Business on her laptop and sees 312 unread threads. By 09:00 there will be another 90. Of those, maybe four are urgent. Maybe one needs a nurse to call back inside the hour. The rest are reschedules, parking questions, "kan de wijkverpleegkundige iets later komen", and three or four chains about a leaking IV line.

The old system was Marleen reading every message and forwarding the serious ones to the on-call nurse via a WhatsApp group with 12 people in it. The new system is a Dutch-language agent that triages each thread, escalates the real emergencies inside 90 seconds, and never lets a sentence with a BSN or medication name leak into a logfile.

This is the playbook.

The shape of 540 daily messages

Before we wrote a line of code, we sat with Marleen for two days and read every message from the past week. 3,780 threads. The distribution was the kind of long tail that makes triage worth automating:

  • 61% logistical: shift times, parking codes, gate access, family pickup times.
  • 19% medical, non-urgent: medication refill reminders, wondzorg follow-ups, blood pressure logs.
  • 14% emotional or social: a daughter checking in, a client who is lonely, a complaint about a previous shift.
  • 5% medical, urgent: falls, sudden confusion, blood, breathing trouble, no response at the door.
  • 1% spam, wrong number, or the local pharmacy.

The 5% is what kept Marleen up at night. Two of those need a nurse in under an hour. The rest need a callback within four. If one of them sits unread because it arrived between 76 messages about parking, that is the kind of mistake that ends a thuiszorg licence.

The agent does not need to be smart about parking codes. It needs to be paranoid about the 5%.

Routing rules that respect Dutch

We built the agent on the official WhatsApp Business Platform (the Cloud API hosted by Meta), not the consumer app. That distinction matters: only the Business Platform gives you message webhooks, template messages, and a documented retention model. The consumer app cannot be automated legally. The official docs live at developers.facebook.com/docs/whatsapp/cloud-api.

Each inbound message hits a webhook. The agent runs a two-stage classifier:

# stage 1: intent + urgency, Dutch-tuned prompt
SYSTEM = """Je bent een triagemedewerker voor een thuiszorgorganisatie in Almere.
Klassificeer het bericht in een van de volgende categorieen:
  L  = logistiek (tijden, parkeren, sleutel, route)
  MN = medisch niet-urgent (medicatie, wondzorg, controle)
  E  = emotioneel of sociaal
  MU = medisch urgent (val, bloed, ademnood, verwardheid, geen reactie)
  S  = spam of verkeerd nummer

Antwoord met JSON: {"cat": "...", "confidence": 0.0-1.0, "reden": "..."}
Bij twijfel tussen MN en MU: kies MU.
"""

That last line is doing a lot of work. The cost of a false negative on MU is a nurse arriving too late. The cost of a false positive is a nurse reading one extra message. Every threshold in the agent tilts toward that asymmetry.

Stage two is a deterministic rule layer. If stage one says MU with any confidence, escalate. If it says MN but the message contains any of a 47-word Dutch red-flag list ("bloedt", "ademt niet", "gevallen", "niet wakker", "blauw", "stuiptrekkingen", "verward"), upgrade to MU. The list was written by the head of nursing, not by us.

Warning

Never let the LLM be the only gate between an urgent message and a nurse. A keyword tripwire downstream of the model catches the 0.3% of cases where the model gets cute. The cost of running it is one regex per message.

Handing off to a nurse in under 90 seconds

The 90-second SLA is measured from the WhatsApp delivery timestamp to the moment a named nurse acknowledges the alert. We hit it on 97% of MU messages in the first six weeks. The path looks like this:

  1. Inbound WhatsApp message arrives, webhook fires (median 1.2s).
  2. Stage one plus stage two classifier runs (median 2.8s, p95 4.1s).
  3. If MU: a structured alert lands in three places at once. A Telegram group for the on-call nurse, a SIP call to her work phone via Twilio, and a status row in Postgres.
  4. The nurse acknowledges with a single tap. If no ack inside 60 seconds, the alert escalates to the back-up nurse. If no ack inside 90 seconds, it pages the duty manager.

The fan-out matters. WhatsApp groups are unreliable as an alert channel because notifications can be muted per-thread and Meta gives no read receipt to the sender's server. Telegram, SIP, and a database write together mean at least one channel will reach a human inside the SLA.

// alert fan-out: fail loud, fail fast
async function escalateMU(msg: TriagedMessage) {
  const alert = {
    id: msg.id,
    client: msg.clientRef,        // never the name, only the internal ref
    summary: msg.redacted,        // PII-scrubbed one-liner
    received_at: msg.ts,
    nurse_on_call: rota.current(),
  }

  await Promise.allSettled([
    telegram.send(alert.nurse_on_call.tgChat, render(alert)),
    twilio.calls.create({ to: alert.nurse_on_call.phone, url: TWIML_ALERT }),
    db.alerts.insert(alert),
  ])

  // start the 60s ack timer
  scheduleAckCheck(alert.id, 60_000)
}

Promise.allSettled instead of Promise.all is deliberate. If Telegram is down, we still want the SIP call to go out, and vice versa. We log the failed channel and keep moving.

PII under the AVG

Dutch homecare runs under the AVG, the local implementation of GDPR, supervised by the Autoriteit Persoonsgegevens. Health data is a special category. A client's name plus the fact that they receive thuiszorg is already health data. A medication name in a chat log is health data. A BSN in a chat log is a notifiable breach if it leaks.

Our rule: no client-identifying string ever reaches the LLM, ever reaches a third-party logging service, or ever sits in cleartext on disk longer than 14 days.

We do this with a redaction pass that runs on our infrastructure before the message ever leaves the VPC:

import re

BSN = re.compile(r"\b\d{8,9}\b")
PHONE = re.compile(r"\b(?:\+31|0)[\s-]?6[\s-]?\d{8}\b")
IBAN = re.compile(r"\bNL\d{2}[A-Z]{4}\d{10}\b")

def redact(text: str, client_ref: str) -> tuple[str, dict]:
    """Replace identifiers with stable placeholders. Returns (text, vault)."""
    vault = {}
    def stash(match, kind):
        token = f"<{kind}:{len(vault)}>"
        vault[token] = match.group(0)
        return token

    text = BSN.sub(lambda m: stash(m, "BSN"), text)
    text = PHONE.sub(lambda m: stash(m, "TEL"), text)
    text = IBAN.sub(lambda m: stash(m, "IBAN"), text)
    # client name lookup runs from a hashed bloom filter, not the full list
    text = redact_known_names(text, client_ref, stash)
    return text, vault

The vault stays on our server. The LLM sees <BSN:0> and <TEL:1>. When the agent's reply needs to address the client by name, we substitute back on the way out, never on the way in. We chose this over a clever LLM-side prompt because regex does not get confused by a tired model at 03:00.

For Dutch BSN specifically, the 11-proof check (elfproef) lets us validate before redaction so we do not waste a vault slot on a phone number that happened to look like a BSN. The algorithm is documented on the Wikipedia entry for Burgerservicenummer.

The agents.md file that runs the show

There has been quiet debate on Hacker News this month about whether agents.md files actually help coding agents. The arguments cut both ways. Our experience is narrower and clearer: for a production triage agent that multiple people maintain, a single markdown file at the root that defines roles, escalation rules, and red-flag vocabulary is the difference between a system the head of nursing trusts and one she does not.

Ours is 240 lines. It covers:

  • The role and the boundaries ("never give medical advice, never confirm or deny medication dosages").
  • The five categories with three example messages each in Dutch.
  • The red-flag list, kept in version control by the head of nursing.
  • The exact JSON schema the classifier must return.
  • The escalation matrix.
  • The phrases the agent must use verbatim when handing off ("Een verpleegkundige neemt binnen 15 minuten contact op").

The file is the prompt and the documentation. When the head of nursing wants to add "kortademig" to the red-flag list, she edits the markdown, opens a pull request, and the change goes live after one review. No one touches the classifier code.

What we measured after week six

The numbers from the last full reporting week (week 22 of 2026):

  • 3,812 inbound messages.
  • 187 classified as MU. Eleven of those were re-classified by the duty nurse as MN on review. False-positive rate: 5.9%.
  • Two MN messages were later flagged as missed urgencies by Marleen on her end-of-day audit. False-negative rate: 0.05% of MN, 1.1% of true MU.
  • Median time from inbound to nurse acknowledgment: 47 seconds. p95: 82 seconds. p99: 134 seconds.
  • Hours of coordinator time freed: roughly 23 per week, measured by Marleen's own timesheet against the four weeks before launch.

The 1.1% false-negative on MU is the number we still watch. Two missed urgencies in six weeks is two too many. The fix in flight is a second, smaller model running in parallel on every MN classification, voting against the first. If they disagree, the message escalates.

What you can do this afternoon

If you run an operations team that lives in WhatsApp, the five-minute audit is this. Export the last 200 inbound threads. Label each one MU, MN, L, E, or S by hand. Ask what fraction of MU messages were acknowledged inside your own SLA. If the answer is below 95%, you have a triage problem, not a staffing problem.

When we built this chat agent for the Almere provider, the thing we ran into hardest was the AVG redaction pass. Every prompt iteration tempted us to give the model more context, and every time we caught it we cut the context back. We ended up shipping with less context and tighter rules, and it has held.

Key takeaway

Build the urgent path first and automate the boring 95% second; the agent earns its keep on the messages a tired human would miss.

FAQ

Why use the WhatsApp Business Platform instead of automating the consumer app?

The consumer app cannot be automated legally and breaks at every Meta update. The Business Platform Cloud API gives you webhooks, templates, a documented retention model, and a clear AVG processing role.

How does the agent handle Dutch dialect and informal spelling?

The red-flag list includes common misspellings (gevalle, blod, kortasem) added by nursing staff. The LLM handles the rest. We measure regression weekly against a labelled fixture set of 200 real messages.

What happens when the LLM is down or slow?

The regex red-flag pass runs first and can escalate on its own without classifier output. If both fail, every inbound message is queued to the human triage channel with no filter, so nothing is silently dropped.

How is the 90-second SLA measured?

Inbound timestamp from Meta's webhook payload to the moment the on-call nurse taps acknowledge in Telegram or answers the Twilio SIP call. Both write to the same Postgres row. We report median, p95, and p99 weekly.

Can the same architecture work for GPs or dental practices?

Yes, with a redrawn category set and a different red-flag list. The two-stage classifier, the redaction vault, and the fan-out alert pattern carry over. The vocabulary and escalation matrix have to be rewritten by the clinical lead.

chat agentsai agentsautomationprocess automationworkflowcase study

Building something?

Start a project