Voice agents
Voice agents for clinics: shipping against a 13-year-old EPD
The agenda phone rang 1,260 times last week at a 34-person mental-health practice in Hasselt. A voice agent picked up every one, in two languages, against an EPD older than the receptionist's tenure.

The agenda phone at the Hasselt group practice rang 1,260 times last week. A voice agent picked up every one of them. Three hundred of those calls were in French. Eleven were crises. None of them sat in queue longer than 45 seconds.
That sentence took fourteen months and a 13-year-old Daktari install to earn. Here is the playbook, in the order we wrote it.
The Monday-morning phone queue
The practice has 34 clinicians, one full-time receptionist, and a part-time afternoon backup. On a busy Monday before we shipped the agent, the queue would stack twelve deep by 08:25. Most of those calls were variations of "ik moet woensdag verzetten." Schedule changes. Some were French. A handful were people in crisis. A few were patients asking why this consultation cost €78 and the previous one cost €0, a recurring question because mental-health consultations by recognised klinisch psychologen in Belgium are BTW-vrijgesteld but session fees vary by contract type.
The receptionist could not triage that queue by ear in under three minutes. So a person in crisis was, statistically, waiting behind someone calling to push their Wednesday appointment to Thursday. That was the real problem to solve, and it is the problem most "let us add a voice bot" pitches do not solve. A voice bot that picks up faster but routes by FIFO does not help the patient who needs to be picked up first.
What shipping against Daktari actually means
Daktari is a Belgian EPD that has been in continuous use at this practice since 2013. It runs on a Windows Server in a closet. There is no REST API. There is a partner-shipped XML-over-HTTP endpoint for partner integrators, but the partner cost more than the project budget and required a six-month onboarding. The vendor documentation we did have was a 38-page PDF, mostly screenshots, mostly Dutch, partly from 2017.
What we built against, in the end, was a thin internal service we called daktari-bridge. It runs on the same Windows box, watches a queue of write requests, and uses the same UI automation pattern Daktari's own desk staff already use to enter appointments, except scripted, audited, and idempotent. We did not patch Daktari. We did not jailbreak Daktari. We wrote an honest, ugly adapter that did exactly what a typist does, eight times a second.
What "eight times a second" means in practice. The adapter binds to the Daktari main window through UI Automation, locates the appointment grid by accessible-id once at startup, then drives a deterministic keyboard sequence per write. We cache the field-coordinate map in memory and rebuild it on a window-resize event. The whole bridge is about 1,400 lines of C# and one operations runbook that says, in essence: if it hangs, kill the service, the queue will replay on restart. Two screens, one keyboard layout, no surprises.
The day you patch the EPD is the day the vendor stops returning emails. Adapter, not surgery. Every minute you spend trying to be clever inside the EPD database is a minute you will lose at the next vendor update.
Language detection before the first sentence
Hasselt sits in Limburg, Dutch-speaking, but the practice draws patients from across the linguistic border. About 24% of inbound calls are in French. We tested three approaches.
- Ask the caller to pick a language. Slow. People hang up.
- Detect language from the first sentence. Decent, but the first sentence in a mental-health call is often "euh, bonjour, c'est Madame X." Two words. Ambiguous tone.
- Detect language from the phone number prefix and the time-of-day prior, then refine after first audio.
We shipped the third. A number prefixed +32 4 (Liège area) gets a French-first opening. A number prefixed +32 11 (Hasselt area) gets a Dutch-first opening. Unknown numbers default to Dutch with a French follow-up if the first response misses. The hit rate on first-greeting language match is 91%. The 9% we get wrong, the agent corrects within the second turn without making a scene.
The hand-rolled signal is small and stupid and works. The thing not to do is fine-tune a bilingual model on your call recordings. You do not have enough data, your data is medical, and your data is not yours to fine-tune on.
The 45-second crisis budget
The hard constraint from the practice's clinical lead was this. A caller in crisis reaches a clinician within 45 seconds of the call connecting. Not 45 seconds of queue. 45 seconds of dial-to-human-voice.
That number was not negotiable. Everything in the agent's architecture sits downstream of it.
We split the audio path two ways. The first stream runs the conversational LLM at normal latency. The second stream runs a small, fast classifier on a 4-second sliding window, looking for crisis indicators in both languages. The classifier is not the brains of the operation. It is the smoke detector. False positives are fine, because the cost of a false positive is the praktijkmanager picking up a non-crisis call. The cost of a false negative is a real one.
When the classifier fires, three things happen in the same tick.
- The conversational agent shifts to a steady, low-affect script: "Ik blijf bij u aan de lijn. Ik haal een collega erbij."
- A SIP transfer to the on-duty GZ-psycholoog rotation kicks off in parallel with (not after) the verbal handoff.
- The daktari-bridge records nothing yet. Calendar mutations on a crisis call are forbidden until a human clinician has confirmed the disposition.
End-to-end median from connect to clinician-on-line: 31 seconds. p95: 42 seconds. We have not missed the 45-second SLA in 19 weeks.
Why payments hit a human before the calendar commits
The other class of call we refuse to write to the agenda for is anything that touches money. In Belgium, a psychotherapy consultation by a recognised clinical psychologist is BTW-vrijgesteld under article 44, §2, 1° of the BTW-wetboek. But BTW-vrijgesteld does not mean free. It means no VAT on the invoice. Patients confuse this constantly, and so do half the GPs who refer them.
If the caller asks any payment question, the agent does three things. It answers the factual layer it is allowed to answer (yes, the consultation is BTW-vrijgesteld; the practice does not do third-party payer billing). It does not commit any agenda write. And it schedules a callback from the praktijkmanager within four working hours, written into a separate intent queue, not into Daktari.
Why no agenda write on a payment-flagged call? Because in eight months of pilot data, calls with a payment question had a 31% no-show rate versus 6% for clean booking calls. People who do not understand what they will be charged do not show up. So we do not book them. The slot stays open for someone who will.
Do not let the voice agent commit a write on any call where the caller does not yet know what they will pay. Defer the calendar mutation. Always.
The two-phase agenda write
Even on a clean rebooking call, we do not write to Daktari live. We write to a staging table. A second process, running on the same Windows box and reading the same queue, replays the write into Daktari every six seconds. The replay is idempotent, keyed on a UUID we generate at intent time.
create table agenda_intent (
id uuid primary key default gen_random_uuid(),
call_id uuid not null,
intent_kind text not null check (intent_kind in ('book','reschedule','cancel')),
patient_ref text not null,
slot_start timestamptz not null,
clinician_id text not null,
created_at timestamptz default now(),
applied_at timestamptz,
conflict_at timestamptz,
conflict_reason text
);
create unique index agenda_intent_idem
on agenda_intent (call_id, intent_kind, slot_start);
This buys us four things.
- If Daktari is down (it has been, four times in fourteen months, each time during the vendor's quarterly patch window), the queue waits and replays. The patient never knows.
- If a clinician changes the slot manually inside Daktari while the queue is waiting, we detect the conflict on replay and surface it to the receptionist instead of overwriting.
- We can audit. Every write has a recording, a transcript, an intent payload, and a final Daktari state. Five thousand calls in, we have a clean log.
- We can roll back. A misclassified intent can be reversed before it ever touches the EPD.
There is a fifth thing the staging table buys us that we did not anticipate. It gives the receptionist a queue she can supervise. Every morning she opens a small dashboard and sees yesterday's intents, the applied count, the conflict count, and the average time-to-apply. The dashboard is the first thing she looks at before email. She catches edge cases there that no eval suite would have flagged, because she knows which clinicians take long lunches and which patients never confirm by text.
The receptionist used to spend roughly 70% of her morning answering the phone. She now spends about 30% of it, on the calls the agent escalates, which are also the calls she actually wants to be on.
The guardrails the receptionists wrote
The most useful tests in the suite were not written by us. They were written by the receptionist and the two clinicians who volunteered for the pilot. They wrote 84 example calls in Dutch, French, mixed-code, with mumbling, with a barking dog, with kids screaming in the background, and a one-line expected outcome for each. We replay those 84 against every model and prompt change before it leaves staging.
That suite has caught more regressions than any synthetic eval we wrote. The reason is straightforward. The receptionist has a model of what a Tuesday-evening rescheduling call sounds like that we, as engineers, do not. She wrote tests against that model. When we changed the conversational prompt last March and the new version began over-asking for the patient's national register number on French calls, her suite caught it the same afternoon.
The question we keep getting from peers is whether a local, self-hosted model can replace the hosted one for a workload like this. Six months ago, on the practice's own hardware, we tested two local Whisper variants for transcription. They were good enough for the smoke-detector classifier layer and not good enough, yet, for the conversational one. We are re-running that comparison next quarter, because the gap is closing fast.
What you can do today
If you run a clinic with a legacy EPD and a phone queue, the smallest useful thing is this. Instrument the queue. Log every call's wait time, its language, its outcome category, and whether the patient showed up. Two weeks of that data will tell you whether a voice agent is the right shape of solution, or whether you need an extra half-day of receptionist hours instead.
When we built the voice agent for the Hasselt practice, the part we underestimated was how much of the work was not the model. It was the queue, the staging table, and the receptionist sitting next to us writing tests. We shipped a smaller agent than we planned, and a much bigger adapter, and the 45-second SLA held.
Key takeaway
Triage before transcription, queue every write through a staging table, and refuse to book any call where the caller does not yet know what they will pay.
FAQ
Why not use Daktari's partner XML endpoint instead of UI automation?
The partner programme cost more than the project budget and required a six-month onboarding. UI automation against the same screens the desk staff already use was honest, auditable, and shippable in weeks.
What stops the voice agent from booking a crisis caller into the wrong slot?
Calendar writes are blocked entirely on any call the crisis classifier flags. The SIP transfer goes through, the agent stays on the line, and no agenda mutation happens until a human clinician confirms the disposition.
How do you handle French callers from Wallonia who call the Dutch-speaking practice?
Phone number prefix gives the first guess. +32 4 numbers open in French; +32 11 in Dutch. Unknown numbers default to Dutch with a French follow-up if the first response misses. First-greeting hit rate is 91%.
Why defer payment questions to the praktijkmanager instead of answering them?
Calls with a payment question had a 31% no-show rate in pilot data versus 6% for clean booking calls. Booking a confused patient burns a slot. A four-hour callback from a human resolves the confusion before the slot is committed.
Could a local model replace the conversational LLM here?
Six months ago, no. Local Whisper variants were good enough for the smoke-detector classifier but not for the conversational layer. We re-test every quarter. The gap is closing.