Voice agents
Voice agents and code-switching: a Utrecht vet case study
A Utrecht vet chain replaced its overnight answering service with a voice agent in three languages. Here is how we kept it from switching languages mid-call, and what it cost per call.

It is 23:14 on a Saturday. A man in Overvecht is holding a cat that swallowed a length of sewing thread. He calls his usual vet. The line picks up on the second ring, in Dutch, calm, asking him to describe what he sees. Two questions later it tells him to drive to the emergency clinic in Nieuwegein, sends the address to his phone, and pings the vet on call so she knows what is coming. The whole exchange takes ninety-two seconds.
The voice on the other end is not a person. It is a voice agent we shipped four months ago for a chain of seven veterinary practices around Utrecht. Their after-hours line used to roll to a paid answering service that would, on a good night, get the right message to the right vet within twenty minutes. On a bad night, the message would land at 06:00 with the words "small dog, breathing funny" and no callback number.
This post is about what we built, what broke first, and why "do not switch languages mid-call" turned out to be the hardest single requirement.
Voice, not chat
We pushed back on voice at the first meeting. Voice agents are more expensive per minute than a chat widget, the latency is harder to hide, and the failure modes are louder. A chat bot that misreads "kat" as "kit" produces an awkward sentence. A voice agent that does the same thing reads the awkward sentence out loud, in a calm robotic voice, while someone's dog is having a seizure.
The clinic owner walked us through the call log. Forty-one percent of after-hours calls came from people who could not type comfortably: elderly owners, people with both hands on an animal, parents holding a child. Another twenty percent came from drivers. The remaining customers tried the website chat and then called anyway, because nobody trusts a chat bot at 02:00 with a sick pet.
So voice it was.
Three languages, one switchboard
Utrecht has a Dutch-speaking majority, a sizeable English-speaking population (a lot of university staff, a lot of knowledge-migrant visa holders), and the practice in Kanaleneiland has a regular Arabic-speaking customer base. Roughly: 78% Dutch, 14% English, 8% Arabic, with a long tail of everything else.
Our first build asked the caller to pick a language. "For Dutch, press 1. For English, press 2. For Arabic, press 3." It tested fine. It shipped. We pulled it out within ten days.
The problem was not the menu. The problem was that people in distress do not listen to menus. They start talking before the prompt finishes, and they talk in the language they think in. A Dutch speaker would press 1 and then say "ja hallo, mijn hond". An English speaker would press nothing and start talking. An Arabic speaker would press 3 and then immediately switch to Dutch, because that is the language she usually speaks with the receptionist.
We rebuilt around language detection from the first utterance.
The detector
Language identification on a single short utterance is a known problem. Whisper does it as a side effect of transcription, the Deepgram language detection endpoint returns a code and a confidence per request, and there are smaller dedicated models if you want to run one yourself.
We landed on a two-stage pipeline. The first 1.2 seconds of audio go to a fast classifier that returns a language code and a confidence score. If confidence is above 0.85, we lock that language for the rest of the call. If it is not, we play a short clarification prompt in Dutch and English ("hallo, you can answer in Dutch, English, or Arabic"), and re-classify on the response.
The locking part matters, and that is the next section.
# Simplified language-lock state
class CallState:
def __init__(self):
self.language = None # 'nl' | 'en' | 'ar'
self.locked = False
self.lock_confidence = 0.0
def consider(self, detected_lang: str, confidence: float):
if self.locked:
return self.language
if confidence >= 0.85:
self.language = detected_lang
self.lock_confidence = confidence
self.locked = True
return self.language
That is the whole language-lock. Nine lines. We spent two weeks trying something cleverer, and every version of cleverer made things worse.
The case for locking
Multilingual voice models are eager to please. If a caller starts in Dutch and slips one English word into a sentence ("mijn dog is ziek"), the underlying STT will sometimes flip its language guess for the next turn, and the TTS will respond in English. The caller, who is Dutch, hears an English sentence, and now we have a problem. She will either switch to English to match the agent (which is what bilingual people do automatically) or she will get confused and hang up. Either way, we have failed.
The default behaviour of every off-the-shelf voice stack we tested was to re-detect language on every turn. That sounds like a feature. In a triage call it is a bug.
Once we lock, the only way to switch is for the caller to ask, explicitly, in any language. We added a tiny intent classifier on every turn that listens for switch-language requests ("kunt u Engels spreken", "can you speak English", and the Arabic equivalent). If it fires with high confidence, we re-prompt, re-detect, and re-lock.
We use one TTS voice per language, all from ElevenLabs, picked to sound like the same calm person across three accents. Same name in the introduction. Same closing line. That cosmetic consistency is what stops callers from feeling like the system is throwing them between three different bots.
The system prompt, mostly
The agent runs on a system prompt that is mostly the triage decision tree, plus a short set of behavioural rules. We keep the prompt under 1,500 tokens. We do not put the full red-flag taxonomy into the prompt; we put the structured decision tree in, and we put one short example dialogue per language at the top.
The most useful single line in the prompt is: "If you do not know the answer, transfer to a person. Do not guess."
Three things stayed out of the prompt deliberately. The vets' phone numbers: the agent does not know who is on call, it calls a webhook that decides. The clinic addresses: same reason, because addresses change and system prompts do not get updated at 03:00 on a Sunday. And the client database: the agent asks for the owner's name and phone, then looks them up via a tool call to the practice-management system.
Separating what the agent knows from what the agent looks up is the part most internal voice-agent attempts get wrong. If the prompt holds the answer, the prompt has to be redeployed every time the answer changes. If the prompt holds the question and a tool returns the answer, the agent stays small.
Telephony, briefly
The phone side runs on Twilio. The clinic's existing numbers were ported in, the after-hours rules live in a Twilio Studio flow, and the websocket bridge between Twilio Media Streams and our agent loop is a small Node service. Nothing exotic. Media Streams gives you raw audio in both directions, and that is all you actually need.
End-to-end latency from end-of-speech to start-of-response sits at 850 to 1100 ms most of the time. Under 1.2s is the threshold below which a caller does not consciously feel they are talking to a machine. Anything above 1.5s starts to feel like a bad VoIP line.
Triage logic
The vet group gave us a printed red-flag list they had been using for years: signs that mean "send to ER now" versus "book first thing tomorrow" versus "this can wait until Monday". We typed it up, structured it, and turned it into the agent's decision tree.
The agent has three possible actions per call. It can send the caller to the 24-hour emergency clinic in Nieuwegein, with the address sent by SMS during the call. It can page the vet on call at the relevant practice, collecting a callback number and a one-line summary so the vet's phone rings with that summary already on screen. Or it can book a normal appointment for the next morning, into the clinic's existing practice-management system.
{
"call_id": "01J7K2...",
"language": "nl",
"language_confidence": 0.97,
"duration_s": 92,
"outcome": "send_to_er",
"summary_for_vet": "Kat, ~4 jaar, draad ingeslikt ca. 30 cm. Geen braken. Eigenaar op weg.",
"summary_for_owner_sms": "Spoedkliniek Nieuwegein, Pijlsweerd 12. Bel vooruit: 030-...",
"red_flags": ["foreign_body_string", "cat"],
"handoff_at_ms": 88300
}
That JSON is what we hand to the vet's phone. No audio replay, no twenty-minute lag, just the structured turn-by-turn record and the one-line summary. The vets stopped asking us to add features to this part of the system after about week three.
What we measured
The first thirty days after go-live, the numbers we cared about:
- 412 after-hours calls handled.
- Median call length: 1 min 38 sec.
- Calls correctly routed (per next-day vet review): 96.1%.
- Calls where the caller asked for a person and we handed off: 4.4%.
- Calls where the agent code-switched against caller intent: 0. This was the metric the clinic owner cared about most.
- Cost per handled call: 0.27 EUR including telephony, STT, LLM, and TTS. The previous answering service charged 1.85 EUR per call with a 12 EUR monthly minimum per practice.
The clinic owner's main feedback after the first month was about a different problem entirely: the vets on call now sleep through fewer false alarms, because the agent triages out the "is this normal" calls before they get paged. One vet told us the July rota was the first month in three years she had not been woken up by a non-emergency.
Three things that broke
The Arabic TTS would mispronounce the practice names. Dutch place names are hard for any TTS that has not been trained on them. We fixed this with an SSML phoneme override for every practice name and street. ElevenLabs accepts IPA inside <phoneme> tags. That took an afternoon.
The agent would, once a week or so, say goodbye in the middle of a turn. It turned out we were truncating the LLM response on a punctuation mark that appeared inside an abbreviation ("dr."). We changed the streaming-stop heuristic to require a punctuation mark followed by 600 ms of model silence.
A handful of Dutch callers got language-detected as German on the first utterance, because they answered "ja" and nothing else. We hard-coded "ja", "nee", "hallo", "alstublieft" and a dozen other one-word responses as Dutch overrides before the classifier even runs.
The handoff
The single best decision we made was to keep the human-handoff path obvious and short. Any caller who says, in any language, anything that maps to "I want to talk to a person" gets transferred. We do not try to handle complaints. We do not try to handle billing disputes. We do not let the agent argue with anyone.
A voice agent's job is to do the boring 90% well and to get out of the way for the other 10%. Build the handoff before you build the cleverness.
What we would do differently
We would build the language-lock before the first deploy, not after. We would also start with one language and add the other two in week three, instead of trying to ship all three at once. Arabic accent coverage took longer than we expected, and we should have benchmarked our STT on Levantine, Egyptian, and Maghrebi samples before committing.
When we built this voice agent for the Utrecht chain, the part we underestimated was how strongly callers in distress want to feel that they are talking to one person who understands them, not a system that flips between modes. Every design decision after the first month came back to that. If you are thinking about a similar build, we wrote up our general approach to voice and chat agents on the services page, and the short version is: pick one voice, pick one language at a time, and lock both.
One thing to try today
If you run an after-hours phone line and you want to know whether a voice agent could help, do this for a week. Log every call's duration, the time-of-day, and a one-line outcome ("booked", "wrong number", "send to ER", "billing question"). Add it up. The shape of that spreadsheet tells you whether you have a voice-agent problem or a staffing problem. They are different problems, and they need different fixes.
Key takeaway
Lock the caller's language after one detection. Most voice stacks re-detect on every turn, and in a triage call that is a bug, not a feature.
FAQ
Why not let the caller choose their language with a press-1, press-2 menu?
Callers in distress talk over the menu. We tried it and pulled it after ten days. Detecting language from the first utterance and locking it is what actually worked in production.
What did the voice agent cost per call versus the previous answering service?
0.27 EUR per handled call including telephony, STT, LLM, and TTS, versus 1.85 EUR per call from the previous answering service, with a 12 EUR monthly minimum per practice.
How does the agent handle a caller who genuinely wants to switch languages?
A small intent classifier runs on every turn and listens for explicit switch requests in any of the three languages. If it fires with high confidence, the agent re-prompts, re-detects, and re-locks.
What happens if the agent does not know how to handle a call?
It transfers. The rule in the system prompt is one line: if you do not know the answer, transfer to a person, do not guess. The handoff path is built before the clever parts.