← Blog

Voice agents

Voice agent pre-launch checklist: what we audit for a Dutch GP

Before a voice agent picks up the line at a huisarts, we run a checklist that has nothing to do with the model and everything to do with what calls actually sound like at 8:02 on a Monday.

Jacob Molkenboer· Founder · A Brand New Company· 4 Jun 2026· 8 min
Cream bakelite phone receiver off-hook on leather blotter, green silk ribbon on index card, brass bell, wax seal.

It is 08:02 on a Monday in a huisartsenpraktijk somewhere in Utrecht. The phone lines opened at 08:00. Three doktersassistentes are already on calls. Eleven more callers are queued. The fourth assistant is the voice agent we built, and she has been answering for two minutes.

This is the moment the pre-launch checklist either pays off or it doesn't. Everything you do before that minute is the only thing that matters. After that minute, you are reacting.

Why a model upgrade is not a launch criterion

A voice agent at a GP clinic does not fail because the language model is wrong. It fails because the call sounded like a stroke and the agent booked an appointment for Thursday. It fails because the agent paused for 1.8 seconds and the caller hung up. It fails because the agent dictated a paracetamol dose that the patient remembered for ten years.

The hard part of a healthcare voice agent has almost nothing to do with which model picks up the phone. The model matters, but containment matters more, and containment is mostly the work you do on either side of the model: what it is allowed to hear, what it is allowed to say, and where it is forced to hand off. Almost nothing on our checklist is about model choice. Most of it is about those rails.

Red-flag triage detection

The first thing we test is what we call the no-book list. There are calls that the agent must never resolve on its own, no matter how confident it sounds. The NHG-Triagewijzer, the Dutch GP triage standard, sorts complaints into five urgency levels, from U1 (life-threatening, ambulance now) down to U5 (advice). For our agents the working rule is simpler: anything that could be U1 or U2 leaves the agent's hands inside ten seconds.

We seed the system prompt with the actual red-flag inventory the assistant team uses. Chest pain. One-sided weakness. Slurred speech. A baby under three months with a fever. Stridor. Suicidal ideation. Heavy bleeding. We then test each one cold, in Dutch, in a clinical voice and in a panicked voice. We test them while the caller is doing something else: mumbling, breathing irregularly, with a TV in the background.

// no-book list: phrases that must escalate, in Dutch
const RED_FLAGS = {
  cardiac: ["pijn op de borst", "drukkend gevoel", "uitstraling naar arm"],
  stroke:  ["scheef gezicht", "kan niet uit woorden komen", "plotseling raar"],
  paeds:   ["baby onder de drie maanden", "koorts bij baby"],
  mental:  ["uit het leven stappen", "doe mezelf wat aan"],
};

// runs every turn on a rolling 20s transcript window
function shouldEscalate(transcript: string): boolean {
  const lowered = transcript.toLowerCase();
  return Object.values(RED_FLAGS)
    .flat()
    .some(phrase => lowered.includes(phrase));
}

The grandparent test

Then we run what we call the grandparent test. A large share of calls into a Dutch huisarts come from people who do not describe symptoms the way the NHG describes them. A 78-year-old woman does not say "I have unilateral facial droop." She says "ik voel me niet lekker en mijn man kijkt raar." We hand the agent a hundred of those phrasings, sampled from anonymised call notes, and we check whether the red-flag layer still fires. If it does not, we widen the patterns and we test again.

AVG, NEN 7510, and the things we never write down

A Dutch GP clinic is subject to the AVG and, in practice, also to NEN 7510, the national standard for information security in healthcare. The Autoriteit Persoonsgegevens has been increasingly active in healthcare data. Before we ship, we walk through three things.

First, recording. We default to no audio retention. The transcript is enough for almost every operational need. If the clinic insists on audio (some do, for assistant training), it is encrypted at rest, regional EU storage, and a 30-day rolling delete. No exceptions.

Second, transcripts. We strip the BSN, the citizen service number, before it hits any long-term store. The agent can hear a BSN, validate it against the HIS in real time, and then drop it. We test this by reading thirty fake BSNs into the line and checking the database afterwards. Anything that leaks goes back to the redactor.

Third, the verwerkersovereenkomst. The data processor agreement is signed before the line goes live, not after. If you launch and then chase paperwork, you are the news item.

The hand-off

Every checklist item above can fail. The hand-off is what keeps a failure recoverable.

We give the agent three triggers to hand off to a human:

  • The caller asks for one ("ik wil een mens", "kan ik iemand spreken", anything that maps to that intent).
  • The agent's confidence on intent classification drops below a threshold for two consecutive turns.
  • The red-flag layer fires.

The hand-off itself is a warm transfer to a free assistant, with a one-line spoken handover ("mevrouw De Vries, 64, klaagt over pijn op de borst sinds twintig minuten") and the full transcript pushed to the assistant's screen before the call connects. If no assistant is free within twelve seconds, we route to the doctor's direct line.

Warning

Voicemail is not a hand-off. It is a hang-up dressed differently. If your fallback is "leave a message and we'll call back," you do not have a fallback.

Hours, the HAP, and 112

A clinic that closes at 17:00 does not stop having patients at 17:00. Every voice agent we ship has a clock, a calendar, and an opinion about the difference between regular hours, lunch break, public holidays (Koningsdag is the classic miss), and the moment calls should be forwarded to the regional huisartsenpost.

The 112 reminder is separate, and it sits at the top of the prompt. If anything in the caller's words or tone suggests an emergency the agent cannot judge, the closing line is always the same: "Bij twijfel, bel 112." We do not let the model rephrase that line. We pin it as a literal string and we test that pin every release.

Hallucination guards on appointments and medication

This is the section that goes wrong in demos and embarrasses people in production.

Two rules. The agent never books an appointment slot it has not just read back from the HIS calendar in the same turn. If the HIS is unreachable, the agent says so and offers a callback. It does not "tentatively pencil in" anything. We have seen demos where the model invents a Thursday at 10:15 that does not exist in any system. That is a clinic-level incident waiting to happen.

The agent also never says a number out loud about medication. Not a dose, not a frequency, not a strength. If the caller asks ("hoeveel paracetamol mag mijn dochter?"), the agent acknowledges and routes to an assistant. We tested this with thirty variations and the model wanted to be helpful on twenty-eight of them. We had to engineer the helpfulness out, with a hard string-match guard on top of the prompt instruction. The prompt alone was not enough.

The audit log

One thing nearly everyone forgets: the audit log. Every action the agent takes (every booking, every cancellation, every prescription request routed to the doctor) is written to a per-call ledger with a timestamp, a caller hash, the action taken, and the model's stated reason. We make this readable by the doktersassistente, not just by a developer.

This is the cheapest insurance you will buy. Two hours of work to build, and the first time a caller says "the receptionist promised me an appointment at 10:00," you have the actual transcript and the actual database write. Or you do not, and you apologise.

The stress test

The last thing we do before a single real call lands is a 90-minute session with three of us on the line, taking turns being the worst version of a caller. Background TV. Toddler crying. Mumbled Dutch. Frisian accent. Mid-sentence hang-up and call back. Trying to barge in over the agent's greeting. Reading a fake BSN with the last digit wrong on purpose. Hanging up at three seconds to see whether the agent treats it as a deliberate short call or a network blip.

We log every failure with a timestamp and we do not ship until the failure list is empty or every remaining item has a written acknowledgement from the clinic owner. Not the assistant team. The owner.

The shadow week

When the checklist passes, the agent does not go live to the public yet. It shadows. The phone line stays exactly as it was the week before. The agent listens to every call and produces what it would have said, into a side channel only we and the clinic read. We compare its actions against what the human did, for five working days. We tune the gaps. Then, and only then, we put it on the line.

The smallest thing to do today

If you run a clinic and you are thinking about voice automation, do not start with a vendor demo. Start with one hour of listening to your own incoming calls, with permission, and writing down every single thing a human did in the first ten seconds. The list will surprise you. The voice agent has to do all of that, plus none of the things on the no-book list. Everything else is engineering.

When we built the voice agent for a huisartsenpraktijk in the Randstad earlier this year, the thing we kept running into was the gap between how the NHG describes a call and how a Monday-morning call actually sounds. We ended up training the red-flag layer almost entirely on transcripts from the clinic's own assistants, not on the published protocol.

Key takeaway

A voice agent at a GP clinic does not fail on the model. It fails on the rails on either side, and the pre-launch checklist is mostly about those rails.

FAQ

Is it legal to use a voice AI agent at a Dutch GP clinic?

Yes, under the AVG and NEN 7510, provided a signed verwerkersovereenkomst is in place, data stays in the EU, BSN is stripped from long-term storage, and the caller is informed an automated system answers the line.

Does the voice agent record patient calls?

By default we keep only the transcript, not the audio. If the clinic needs audio for assistant training, it is encrypted, stored in the EU, and deleted on a 30-day rolling window.

What happens if the GP information system is unreachable during a call?

The agent does not invent appointment slots. It tells the caller the system is down, offers a callback inside a stated window, and writes the request to a fallback queue the assistants work off.

How long does the full pre-launch checklist take?

For a single-location clinic, about three weeks: one for setup and HIS integration, one for test passes including the live stress session, and one shadow week before the agent answers a public call.

voice agentsai agentsautomationoperationsworkflowarchitecture

Building something?

Start a project