Voice agents
Voice agent for a Dutch GP: shipping 380 calls a day
It is 08:07 on a Tuesday, the prescription line at a Dutch GP practice has 41 calls in queue, and the voice agent picks up call 42 in 800ms.

The 08:07 problem
It is 08:07 on a Tuesday and the prescription line at a Dutch huisartsenpraktijk has 41 calls in queue. The doktersassistente has already triaged eleven walk-ins, called a stressed mother back about her son's wheezing, and the espresso machine has not been switched on yet. By 09:30 the queue will peak at 67. By noon she will have typed 38 herhaalrecepten into HiX by hand. Somewhere inside that 380-call day is the volume a voice agent should be carrying.
Multiply the morning out. Three out of four calls are repeat-prescription requests where the patient already knows which medication, the practice already knows the patient, and HiX already knows the last issue date. It is the most boring, most necessary work in the building. It is also exactly where a voice agent earns its keep.
What follows is the playbook we used to ship one. No diagrams of "the future of healthcare". Just the parts that hurt during the build.
What the agent actually does
It picks up within one ring. It greets the patient in Dutch. It asks for the BSN, accepts it via DTMF or spoken digits, validates it with the elfproef, and reads it back grouped as four-three-two. It confirms date of birth as a second factor. It then handles one of four intents:
- Repeat prescription. Patient names the medication. The agent finds the matching active MedicationRequest in HiX via FHIR, confirms the dose, and queues a herhaalrecept-aanvraag for the GP to sign off before lunch.
- Triage slot. The agent runs a short complaint dialogue against the Nederlandse Triage Standaard, books a slot in HiX if green or yellow, and escalates red to the doktersassistente.
- Test result callback. The agent says no, those go through the GP in person, and offers a 9-minute consultatie slot instead.
- Anything else. The agent says it cannot help with this and transfers to a human within four seconds.
That last intent is the most important one. The agent's job is not to be impressive. Its job is to handle the 73% of calls that are mechanical and hand the rest over fast.
The telephony layer
Every Dutch GP practice we have worked with runs either KPN ÉÉN, Voys, or a local SIP setup with a hosted PBX. All three terminate as SIP. We point a single DID at a LiveKit room running on a Hetzner box in Falkenstein. LiveKit handles the real-time audio bridge between the caller and the agent loop. The whole stack stays inside one EU region.
The room is short-lived. One call, one room, no recording, no transcript persisted beyond the dialogue turn.
# livekit-sip-trunk.yaml
sip_trunk:
name: gp-prescription-line
numbers: ["+31201234567"]
inbound_addresses: ["sip.voys.nl"]
inbound_auth_username: ${VOYS_SIP_USER}
inbound_auth_password: ${VOYS_SIP_PASS}
dispatch:
rule:
name: route-to-agent
trunk_ids: [gp-prescription-line]
room_config:
agents:
- agent_name: huisarts-agent
metadata: '{"practice_id":"prk_0421","language":"nl-NL"}'
Falkenstein is not arbitrary. Verwerkersovereenkomsten with Dutch GP practices increasingly name a specific EU region rather than "the EU" in general, and Hetzner's Falkenstein data centre gives us a concrete contractual location 410km from Amsterdam with a sub-15ms round trip to the practice's PBX. That latency matters at the telephony edge in a way it does not at the application edge.
ASR, TTS, and the BSN problem
Dutch ASR on noisy 8kHz phone audio is the part that quietly kills most voice agents. We landed on Azure Speech in West Europe for ASR with custom phrases for medication names (the stock model otherwise transcribes "Metoprolol" as "Meta-Prolog" about one time in twelve). For TTS we use Azure Neural Voices, also West Europe, voice nl-NL-FennaNeural.
BSN is the hard case. Patients read out nine digits at speaking pace, the ASR drops or merges one, and now the agent is looking at the wrong patient. Three things made it reliable:
- DTMF first. The agent offers keypad entry as the default and only falls back to spoken digits if the caller is on a non-DTMF handset.
- Elfproef validation. Every BSN gets checked against the official 11-test before any HiX lookup runs. A failing checksum means re-prompt, not retry.
- Grouped readback. The agent reads the BSN back in four-three-two chunks, slowly, with SSML pacing.
def valid_bsn(bsn: str) -> bool:
"""11-proef. See logius.nl/diensten/burgerservicenummer-bsn"""
if not bsn.isdigit() or len(bsn) != 9:
return False
weights = [9, 8, 7, 6, 5, 4, 3, 2, -1]
return sum(int(d) * w for d, w in zip(bsn, weights)) % 11 == 0
def speak_bsn(bsn: str) -> str:
grouped = f"{bsn[:4]}, {bsn[4:7]}, {bsn[7:]}"
return (
f'<speak><prosody rate="80%">'
f'Uw burgerservicenummer is {grouped}.'
f'</prosody></speak>'
)
The single biggest accuracy win was telling the LLM that any BSN it received from ASR was provisional, and that the only legitimate confirmation was a DTMF entry or a verbal "ja" after the grouped readback. Without that rule the model would happily proceed with a transcribed nine-digit string that failed the elfproef on the next turn.
Patient acoustic conditions vary more than any ASR demo suggests. Patients on a landline at home read digits about twice as fast as patients on speakerphone in a car park, and the grouped readback at default rate becomes unusable for the landline group. We tuned the SSML prosody to 80% rate with a 220ms pause between groups, which costs roughly 1.4 seconds of wall time per call and removes a category of "sorry, can you say that again" reprompts. The car park group needs a different fix: the agent listens for vehicle noise on the first turn and, if detected, opens with a one-line warning that it will only accept DTMF for the BSN.
HiX integration via FHIR
ChipSoft's HiX exposes an HL7 FHIR R4 endpoint inside the practice network. The agent never calls it directly. A small mediation service inside the practice VLAN translates tool calls from the agent into FHIR searches and writes, scoped to one OAuth client with read access to Patient and MedicationRequest and write access only to a herhaalrecept-aanvragen worklist.
Two FHIR tools plus one scheduling tool are all the agent needs:
async def find_active_prescription(bsn: str, drug_query: str) -> dict | None:
patient = await fhir.search(
"Patient",
identifier=f"https://fhir.nl/fhir/NamingSystem/bsn|{bsn}",
)
if not patient.entry:
return None
pid = patient.entry[0].resource.id
meds = await fhir.search(
"MedicationRequest", subject=f"Patient/{pid}", status="active"
)
q = drug_query.lower()
for m in meds.entry:
text = m.resource.medicationCodeableConcept.text.lower()
if q in text:
return {
"id": m.resource.id,
"drug": text,
"dose": m.resource.dosageInstruction[0].text,
"last_issued": m.resource.authoredOn,
}
return None
async def queue_repeat_request(mr_id: str, patient_ref: str, note: str) -> str:
task = await fhir.create("Task", {
"status": "requested",
"intent": "order",
"focus": {"reference": f"MedicationRequest/{mr_id}"},
"for": {"reference": patient_ref},
"description": "Herhaalrecept via spraakagent",
"note": [{"text": note}],
})
return task.id
The GP sees the queued tasks in HiX exactly the way she sees any other repeat-prescription request, and signs them off in batch. No new screen, no new login, no shadow worklist.
Triage that knows when to give up
For the triage slot intent, we use the Nederlandse Triage Standaard as the rubric. The agent walks the caller through a short complaint dialogue and assigns U1 to U5. U1 and U2 transfer immediately. U3 books a same-day slot. U4 and U5 book the next available slot or offer a teleconsult.
The system prompt does not contain medical reasoning. It contains a decision tree, the script for each branch, and a hard rule: if anything in the patient's words trips a red-flag list (chest pain, sudden numbness, suicidal thoughts, paediatric high fever), the agent transfers in its next utterance. No "let me check one thing first". No reassurance. The next thing the caller hears is the doktersassistente.
The agent never diagnoses, never reassures, never says "that sounds fine". Reassurance from an LLM in a triage context is a liability the practice does not need. A Canadian tribunal held Air Canada liable for what its chatbot told a customer, dismissing the company's argument that the bot was a separate legal entity. Healthcare is where that doctrine lands next, and the burden of proof sits with whoever shipped the model.
Keeping audio inside the AVG perimeter
The AVG and the Wet aanvullende bepalingen verwerking persoonsgegevens in de zorg both treat patient voice in a clinical context as bijzondere persoonsgegevens. NEN 7510 layers operational requirements on top. Three rules came out of that and shaped the build:
- No audio leaves the EU. SIP, ASR, TTS, LLM, and FHIR mediation all run in West Europe or Falkenstein. Azure resources are provisioned with the customer-managed key option and the West Europe region pinned in the deployment manifest, not just the portal.
- Audio is not stored. LiveKit streams audio frames to ASR and discards them. We keep no recording. The only artefact is a structured turn log (intent, slot values, hashed BSN, timestamps) retained for 30 days for incident review, then deleted.
- BSN never lives in a log. The mediation service sees the BSN once, looks up the patient, and replaces it with the HiX patient id everywhere downstream. The turn log stores a SHA-256 of BSN plus a per-practice pepper, only so we can correlate a complaint to a call.
The practice's DPIA documents the data flow at this level of detail. We start from the Autoriteit Persoonsgegevens DPIA template and bolt on the NEN 7510 control mapping. The verwerkersovereenkomst between the practice, the implementation partner, and Microsoft names the West Europe region as a contractual obligation, not just a deployment detail. For the FHIR side, the HL7 FHIR security guidance covers the OAuth scoping and audit-event logging we use against HiX.
What we measure
The agent is judged by four numbers. None of them are "calls deflected".
- Time to first word. Median 740ms from "hallo" to the agent's first syllable. Above 1200ms patients start saying "hallo?" again and the conversation desynchronises.
- Transfer rate. 22% of calls hand off to the doktersassistente. We want this number to stay between 18% and 28%. Lower means the agent is over-reaching. Higher means the BSN flow or intent classifier is broken.
- Wrong-patient rate. Zero across 41,000 calls. Elfproef plus DOB second factor plus grouped readback is what makes that hold. We audit it weekly against the herhaalrecept queue in HiX.
- Assistant hours returned. 5.8 per day, measured against the prior six-week baseline. That is one and a half FTE the practice no longer has to recruit for.
How we rolled it out
You do not ship a voice agent to a GP practice on a Monday morning. We ran two weeks of shadow mode first. The agent answered nothing. It listened, transcribed, and produced a parallel decision tree for every call the doktersassistente handled live. At the end of her shift she saw a small dashboard with the agent's would-be outcome for each call. Disagreements went into a review queue.
After two weeks of shadow mode and four iterations of the system prompt, we moved to canary. The agent handled 10% of incoming calls, time-boxed to the calmest hour of the day (11:00 to 12:00). Every transferred call generated an automatic post-mortem entry. Those weeks killed three classes of bug we would not have predicted: the medication-name homophone (Lyrica vs Lyric), the wheezing-child false-green where the agent booked a U4 instead of escalating, and the Limburgs accent that defeated the digit recogniser on roughly one BSN in three.
Full rollout came after the canary's transfer rate held inside 18-28% for ten consecutive working days. The first morning at 100% we had a senior engineer on the dial-out line and the doktersassistente had a single keystroke to drop the agent back to 0% if anything looked wrong. She used it twice in the first week. Both times for the right reason: an unfamiliar caller pattern that turned out to be a pharmacy calling through the patient line.
The smallest thing you can do today
When we built the voice agent for this practice, the thing that nearly killed the project was not the LLM, the ASR, or the FHIR integration. It was the BSN readback. Patients trust nine digits to a voice on the line about as much as you trust your bank card to a stranger. Until the agent could read those digits back grouped, slow, and right every single time, nothing else mattered. We solved it by treating ASR-captured BSN as provisional and only accepting DTMF or a post-readback verbal "ja" as ground truth.
Pick one phone call your team handles forty times a day. Listen to five of them back to back. Write down the four intents. That is your spec.
Key takeaway
A healthcare voice agent earns its keep by getting BSN readback and transfer-on-doubt right before anything fancier ever runs.
FAQ
Does the voice agent diagnose patients?
No. It walks the Nederlandse Triage Standaard rubric, books slots, and transfers anything red or ambiguous to the doktersassistente within four seconds.
Where does the audio go?
Nowhere persistent. LiveKit streams audio frames to Azure Speech in West Europe and discards them. No recording, no transcript kept beyond the structured turn log.
How does the agent verify the patient?
BSN entered by DTMF or spoken digits, validated by the elfproef, grouped readback for confirmation, then date of birth as a second factor before any HiX lookup.
Can the agent write into HiX directly?
Only through a mediation service in the practice VLAN with OAuth scopes limited to Patient and MedicationRequest read plus the herhaalrecept-aanvragen Task write list.
What happens when the agent gets confused?
It says it cannot help and transfers to a human in the next utterance. Around 22% of calls hand off. The system is designed for that, not against it.