Voice agents

Voice agent rollout: 14 physio clinics in two languages

A voice agent that handles a 47-person physio chain's phone line in two languages sounds simple, until you map the receptionist's actual Monday morning.

Jacob Molkenboer· Founder · A Brand New Company· 20 Nov 2024· 9 min

Cream bakelite phone receiver off-hook on dark green leather blotter with chartreuse ribbon, folded paper cards, brass bell.

Maandagochtend, 8:42. Three phone lines are blinking at the central reception of a 47-person physiotherapy chain in Utrecht. The receptionist has a tab open for each of the 14 clinic calendars, an Excel sheet of family-discount rules pinned to the wall, and a colleague in the Overvecht clinic asking over Teams whether the new Arabic-speaking patient was booked into Lange Nieuwstraat or Kanaleneiland. By 9:15 the queue is fourteen people deep and somebody has hung up. This is the morning that decides whether a voice agent is going to earn its keep.

This is the call we picked up last winter. The chain wanted one number, one promise to the caller, and zero lost family discounts. They did not want a chatbot pretending to be a voice agent, or a voice agent pretending to be a receptionist. They wanted the phone line to actually work.

What follows is the playbook we used. It is opinionated. The stack will look different in 18 months. The order will not.

Two weeks of call shadowing before anything else

Before we touched a single SDK we sat with two receptionists for ten working days and tagged every inbound call. The tagging schema was crude on purpose: new intake, return visit booking, reschedule, cancel, billing question, "is therapist X in today", and a catch-all "other". We also tagged the spoken language and whether the caller switched mid-call.

The number that mattered: 71% of calls fell into three categories (new intake, reschedule, cancel). The other 29% was a long tail that no voice agent should touch on day one. We wrote that line on the whiteboard and left it there for the rest of the project.

A second pattern surfaced that no one had quantified before. Roughly one in six callers spoke Dutch but listed Arabic as their preferred written language in the EHR, and about half of those callers code-switched the moment the conversation hit anything medical. That data point shaped the language strategy more than any model benchmark.

Takeaway

Audit the calls before you pick the model. The shape of your inbound volume tells you which 70% to automate and which 30% to ringfence for a human.

Calendar source of truth, picked first, no exceptions

The chain ran on a Dutch physiotherapy EHR with a clinical calendar (one per practitioner) and a separate "block" calendar for rooms. Fourteen clinics, roughly 60 practitioners, four room types. We mapped the booking model on paper before writing a line of code.

The rule we landed on: the EHR is the only writer. The voice agent reads availability through the EHR's appointment API and writes new bookings through the same endpoint. No caching of slots beyond 60 seconds, no parallel Google Calendar, no "sync layer". Sync layers between two calendars are how you double-book a Friday at half past four.

// Single read path, single write path
async function findSlots(clinicId: string, specialty: string, window: DateRange) {
  return ehr.appointments.searchAvailability({
    clinic: clinicId,
    specialty,
    from: window.start,
    to: window.end,
    durationMin: 25,
  })
}

async function bookSlot(input: BookingInput) {
  // EHR is the lock holder. If the slot was just taken,
  // the EHR returns 409 and we re-quote to the caller.
  return ehr.appointments.create(input)
}

The 409 path is the one nobody talks about and the one that matters. When two callers ask for the same Wednesday at 14:00 across two clinics, the EHR rejects the second booking. The agent has to apologise, re-read availability, and offer the next nearest slot in the caller's language, without sounding like it is reading from a script. We tested this exact path more than any other.

The voice stack we settled on

We tested three pipelines before picking one. The shortlist:

Speech-to-text in Dutch and Arabic with code-switching tolerance. Deepgram's Dutch model held up on telephony audio; for Arabic we fell back to a self-hosted Whisper-large endpoint because Modern Standard Arabic plus Levantine dialect over 8kHz lines was too uneven on the off-the-shelf options.
An LLM in the middle, prompted as a booking agent with a narrow tool list (search availability, propose slot, confirm booking, hand off to human). The model is replaceable. The tool surface is not.
Neural TTS that does not sound like a 2019 IVR. We landed on ElevenLabs multilingual for both languages with a single voice ID per language and the same persona name (Jasmijn) across both. The caller hears one identity, not two.

The orchestration runs on LiveKit Agents, sitting between the SIP trunk (KPN business line, routed via a Twilio Programmable Voice number for the international leg) and our tool layer. We chose LiveKit over rolling our own because barge-in handling and turn detection on a multilingual line is the kind of work where you do not want to be the first person to discover the edge case.

Language detection without the awkward pause

The first prototype answered in Dutch, listened for two seconds, then switched to Arabic if it heard Arabic phonemes. Callers hated it. The pause sounded like a broken line and the elderly patients hung up.

The fix was small and worth its own paragraph. We answer with a bilingual opener: "Praktijk Utrecht, met Jasmijn. Goedemorgen. اهلا و سهلا." The Arabic greeting is short, recognisable, and gives any Arabic-first caller permission to reply in Arabic. The STT layer runs both language models in parallel for the first utterance and picks whichever returns higher confidence. From turn two onward we lock to one language unless the caller switches mid-sentence, which happens more than you'd expect in Utrecht.

The lock-and-release logic is the part that took the longest to tune. A patient describing a hamstring injury will drop the Dutch word hamstring into an Arabic sentence. The agent must not interpret that single token as a language switch. We added a heuristic: at least three consecutive content tokens in the new language, or one full sentence boundary, before we flip the TTS voice.

The family-discount logic, written as code, not as a prompt

This was the part the chain was most nervous about. Their discount rules were unwritten in the EHR. Two parents and one child get a 12% discount on a shared monthly invoice if they all book at the same clinic. Add a second child and the discount becomes 15%, but only if at least one parent's appointment is on the same day. Foster placements count. Grandparents living at the same address count. A patient who moved out last year does not, but their last six bookings still appear in the family record.

We did not put any of this in the prompt. The prompt is for tone and turn-taking. Discount eligibility is a deterministic function the agent calls before confirming any booking:

type FamilyContext = {
  householdId: string
  members: Array<{ patientId: string; role: 'parent' | 'child' | 'other' }>
  sharedClinicId: string | null
  sameDayParentBooking: boolean
}

function familyDiscountTier(ctx: FamilyContext): 0 | 12 | 15 {
  if (!ctx.sharedClinicId) return 0
  const children = ctx.members.filter(m => m.role === 'child').length
  if (children < 1) return 0
  if (children === 1) return 12
  return ctx.sameDayParentBooking ? 15 : 12
}

The agent reads household state from the EHR, computes the tier, and tells the caller what they will be charged before the booking is confirmed. If the computation is uncertain (a foster placement whose start date is in the future, a household record that was merged last week and not yet reconciled), the agent hands off. We logged 41 such handoffs in the first month. The receptionists were grateful for every single one.

Warning

Do not encode pricing rules in the LLM prompt. Prompts drift; pricing must not. Keep the model on tone and turn-taking and put every cent in deterministic code your finance team can read.

Handoff to a human, designed before launch

Every voice agent project has the same failure mode: the agent will not hand off when it should. We pre-wired three handoff triggers:

The caller says any variant of "I want to speak to someone", in either language, including phrases we collected from the call-shadowing logs ("mag ik iemand spreken", "اريد ان اتحدث مع شخص", and the dozen polite Dutch hedges that mean the same thing). The trigger is a string match, not a model call. Latency matters.
The discount or insurance logic returns uncertain. This was the foster-placement case and a handful of out-of-pocket combinations we missed in the first audit.
The model itself raises an internal need_human tool call. We prompted it to err on the side of escalation in the first month and tightened the threshold as the call logs accumulated.

Handoff means a warm transfer to the chain's existing receptionist desk with a one-line summary screen-popped into the EHR: caller name, what they wanted, what the agent already booked or didn't book, why it bailed. The receptionist picks up mid-conversation rather than starting from zero.

Six numbers we watched in the first month

We do not believe in vanity dashboards. The chain's operations lead got a weekly email with six numbers, and only six:

Calls answered by the agent within two rings.
Calls fully resolved without a human (target: 60% by end of month one).
Average handle time on resolved calls (target: under 90 seconds).
Booking conflicts caught by the EHR's 409 response (target: zero that reached the caller as a double-booking).
Discount tier mismatches caught by the weekly reconciliation script (target: zero).
Caller complaints in the EHR's notes field that mention "de robot" or "het systeem" (target: trending down).

By week four the agent resolved 64% of calls without a human, average handle time landed at 78 seconds, and the receptionists were doing the work that used to fall off their plates after 11am: chasing no-shows, processing insurance forms, walking patients through their first appointment. The chain has not added headcount to the reception team since.

What we'd skip if we started today

Two things, both worth saying out loud. First: do not try to do outbound reminders from the same agent on day one. Outbound voice has different consent rules under the Dutch Telecommunicatiewet, a different success metric, and a different model of "did the call go well". Ship inbound, watch it for a month, then decide.

Second: the model you pick today will be replaced inside 12 months. That is fine. Voice agents that book real appointments against real calendars are doing real work today, on models that already exist. The architecture (one calendar of record, deterministic pricing, fast handoff) is what carries.

When we built this for the Utrecht chain, the thing we kept underestimating was how much of the work was operational, not technical: who owns the receptionist's screen-pop UX, who reviews the weekly logs, who signs off on the prompt diff. We've started bundling that work into our voice agent engagements from week one rather than week six.

If you want to start tomorrow, do the call audit. Two receptionists, ten days, one shared spreadsheet, three columns: call type, language, what would have gone wrong if a robot had picked up. That sheet is the brief.

Key takeaway

Audit your calls before you pick a model. The shape of your inbound volume tells you which 70% to automate and which 30% to keep firmly human.

FAQ

Why not use one multilingual model for both Dutch and Arabic STT?

Off-the-shelf multilingual STT was uneven on telephony-quality Arabic with Levantine dialect. Running a Dutch model and a self-hosted Whisper-large for Arabic in parallel for turn one gave us cleaner confidence scores.

What happens if the EHR API is down during a call?

The agent does not fall back to a cached calendar. It apologises, offers to take a callback number, and transfers to the receptionist desk. Booking against stale availability is worse than not booking at all.

How long did the rollout take from kickoff to live calls?

Ten working days of call shadowing, four weeks of build and integration with the EHR, two weeks of supervised pilot at one clinic, then a staged cutover across the remaining 13 clinics over three weeks.

Did you replace the receptionists?

No. The reception team is the same size. They stopped answering routine bookings and started doing the work that was getting dropped: no-show chases, insurance forms, in-person triage. Headcount has not grown since.

voice agentsai agentsautomationintegrationsworkflowcase study

Building something?

Start a project