Voice agents

Voice agents for dental groups: the Friday-spike playbook

Friday at 16:45. Three hundred and eighty parents call fourteen dental clinics at once. The receptionists are gone. The voice agent answers every line.

Jacob Molkenboer· Founder · A Brand New Company· 2 Oct 2024· 9 min

Cream Bakelite phone receiver off-hook on ink-stained leather blotter, green ribbon on open ledger, red wax seal on note.

Friday at 16:45. Three hundred and eighty parents dial a single number. Fourteen Dutch dental clinics, one shared appointment line, one frantic 25-minute window before the receptionists go home for the weekend.

The old setup answered the first 40 calls and dumped the rest to a voicemail that stayed full all weekend. By Monday the practice managers were triaging callbacks for bookings the parents had already made somewhere else.

This is the kind of problem a voice agent is genuinely good for. Not "an AI receptionist that does everything", but a narrow agent that handles two predictable jobs: book a visit, and answer questions about Dutch dental prestatiecodes (the NZa-managed codes that decide what insurance pays). Here is the playbook we used.

Capacity is a telephony problem, not a model problem

The first instinct is to think about LLM throughput. Wrong layer. At 380 concurrent calls, the constraint is your SIP trunk and your media workers.

A typical Twilio elastic SIP trunk gives you per-trunk concurrency in the hundreds, but the carrier-side rate limit for new call setups (CPS) is what kills you in a spike. If 380 parents dial in the same 30-second window, your inbound CPS is around 13 calls per second. Most European CPaaS providers default to 5 to 10 CPS unless you ask. Open a ticket with the carrier and get the limit raised before launch, not during.

For media routing we use LiveKit Agents on a small Kubernetes cluster. Each worker handles 8 to 12 concurrent calls comfortably. For a 400-call ceiling we keep a warm pool of 40 workers plus a hot-standby pool that scales on call_queue_depth, not CPU. CPU lies in voice workloads because most of the wait is for the LLM and TTS, not local compute.

The 800ms latency budget

A phone call feels broken at about 800ms of dead air. The voice pipeline has more hops than people expect:

Caller speech, then VAD detects end of turn (around 180ms).
ASR partial finalised (around 120ms).
LLM first token (around 240ms).
TTS first audio (around 200ms).
Audio reaches the caller (around 60ms network).

That adds up to ~800ms on a good run. There is no margin. Every component was picked for tail latency, not average:

ASR: Deepgram Nova with streaming partials. Whisper is faster on offline benchmarks and slower at p95 streaming.
LLM: a small open-weights model for routing, escalating to a frontier model only when the router is unsure.
TTS: Cartesia Sonic, because time-to-first-audio is the only TTS metric that matters on a phone.

For Dutch specifically, ASR is the weakest link. Limburg, Brabant and Randstad accents differ enough that a 4% word error rate on a benchmark becomes 11% in the wild. Test on real recordings from the actual clinics, not on Common Voice.

The prestatiecode knowledge base

Dutch dental insurance pays out per prestatiecode. C11 is the periodic check. C22 is the long-form intake. M01 is hygiene. V codes cover fillings and depend on tooth surface count and material.

Parents do not know these. They ask "does my insurance pay if Mees gets a filling on his back tooth?". The agent has to translate that into V21/V22/V31/V32 territory, check the basisverzekering rules (zero coverage for adult fillings, full coverage under 18), and answer in plain Dutch.

We built this as a retrieval layer, not a fine-tune. The NZa publishes the mondzorg tariff list as a CSV every January. We ingest it, embed by code plus description plus common parent phrasings ("voorkant", "kies achterin", "spoedvulling"), and store in pgvector. A retrieval call sits inside the LLM hop and adds about 80ms.

Warning

The NZa codes update every January and sometimes mid-year. Your KB is wrong on day one unless you have a refresh job. We poll the NZa portal every Monday at 03:00 and diff against the previous snapshot before the agent ever sees a new version.

Confidence routing and the silent handoff

The expensive failure mode is not "the agent does not know". It is "the agent confidently says something wrong about insurance coverage". The parent books, arrives, receives a 180 euro bill they did not expect, and the clinic eats the goodwill loss.

We route on confidence in two layers:

Retrieval confidence. If the top-k chunks score below threshold, or come from different codes, the agent does not state a coverage answer. It books the appointment and flags it for the practice manager to confirm by SMS in the morning.
LLM self-confidence. The model is prompted to emit a JSON tool call with certainty: high|medium|low. Anything below high on a coverage question triggers the same flag.

Both paths use the same handoff sentence: "Ik boek de afspraak vast in en de praktijk belt u maandagochtend terug over de vergoeding". No dead air, no transfer wait, no apology theatre. The parent gets what they came for (the slot) and the human resolves the part that genuinely needs a human.

Testing at 380 concurrent

You cannot stage-test a voice agent the way you stage-test a web app. The phone network is real, the LLM provider is real, the TTS streams real audio. A real load test has to actually place real calls.

We run two layers:

Synthetic call generator. A Python script using pjsua2 places calls against a staging number and plays pre-recorded parent utterances. Cheap, fast, but does not stress the real telephony carrier.
Real-trunk burst. Once a week we run a 5-minute window against the live carrier's test environment, ramping from 50 to 400 calls. This is the test that catches CPS throttles, SIP REGISTER limits, and DNS rate-limits you did not know existed.

The skeleton of the synthetic generator looks like this. Endpoint and Account initialization (libInit, UDP transport, libStart, Account.create with registration) is the standard pjsua2 boilerplate and is omitted for length; assume ep and acc are live at module import.

import asyncio
import random
import pjsua2 as pj

# ep: pj.Endpoint and acc: pj.Account are assumed initialized.
# See https://docs.pjsip.org/en/latest/pjsua2/intro.html for setup.

class PlayerCall(pj.Call):
    def __init__(self, account, wav_path):
        super().__init__(account)
        self._wav = wav_path
        self._player = None

    def onCallMediaState(self, prm):
        for mi in self.getInfo().media:
            if (mi.type == pj.PJMEDIA_TYPE_AUDIO
                    and mi.status == pj.PJSUA_CALL_MEDIA_ACTIVE):
                self._player = pj.AudioMediaPlayer()
                self._player.createPlayer(self._wav, pj.PJMEDIA_FILE_NO_LOOP)
                self._player.startTransmit(self.getAudioMedia(mi.index))

async def place_call(target: str, wav_path: str):
    call = PlayerCall(acc, wav_path)
    call.makeCall(target, pj.CallOpParam(True))
    await asyncio.sleep(random.uniform(45, 90))
    call.hangup(pj.CallOpParam(True))

async def burst(target: str, n: int, ramp_s: int):
    async def fire(i):
        await asyncio.sleep(random.uniform(0, ramp_s))
        await place_call(target, f"utterances/{i % 40}.wav")
    await asyncio.gather(*(fire(i) for i in range(n)))

if __name__ == "__main__":
    asyncio.run(burst("sip:staging@agent.example.nl", 400, 30))

Forty pre-recorded utterances is enough to cover the long tail of real parent intents: check-up, filling, child first visit, broken bracket, emergency at the back molar, payment question, change appointment, language switch to English. Recording them once with the actual receptionists, not with synthetic TTS, is what makes the load test honest.

Observability is the actual product

A voice agent without recordings is a black box. Every call captures:

Full audio, caller and agent on separate channels, encrypted at rest.
ASR transcript with word-level timestamps.
The retrieved KB chunks for every LLM turn.
A confidence trace: router certainty, retrieval score, LLM self-rated.
The eventual outcome: booking made, callback flagged, human transferred.

The practice manager opens a dashboard every Monday morning and sees something like: "12 calls flagged for callback this weekend, 4 about V codes, 6 about implant aftercare, 2 about an out-of-hours emergency". They work the list in 25 minutes. The agent handled the other 1,420 calls without anyone noticing.

The thing nobody warns you about: GDPR. Call recordings become special-category data once they sit alongside health context. We keep recordings 30 days, transcripts 90 days, and the structured booking data for the period the clinic's processor agreement specifies. The dashboard shows the retention policy on every screen, so nobody pastes a transcript into a group chat the way customer-service teams sometimes do.

The daily review loop

Every morning at 06:00, a job samples 30 calls from the previous day, runs each through a "did this go well?" classifier (prompt plus a small judge model), and posts the bottom 5 into a Slack channel the practice manager and our team share. Nine times out of ten the answer is "yes, the agent handled it". The tenth call is the gold dust. It points at a missing KB entry, a phrasing the ASR mishears, or a confidence threshold that needs adjusting.

After eight weeks the bottom-5 sample stopped surfacing real issues. We dropped the daily review to weekly. The agent now answers around 90% of incoming calls end-to-end, books about 71% of them into the clinic calendar without a human touching the booking, and flags the rest for the morning team.

What this actually costs the clinic to ignore

When we built the voice line for this Dutch dental group, the part that took the longest was not the LLM. It was the SIP trunk capacity negotiation and the prestatiecode KB. We ended up solving the Friday-spike problem by pre-warming the worker pool 20 minutes before the weekly peak and pinning the carrier to a higher CPS ceiling for that window. The same pattern works for any voice agent on a real phone line, whether it answers for a clinic, an MOT garage, or a regional logistics dispatch.

If you are sitting on a line that drops calls at peak and you have not measured what those drops cost, the five-minute audit is this: pull last month's CDR (call detail records) from your PBX, count the calls that hit voicemail between 16:00 and 17:00 on Fridays, and multiply by your average booking value. The number is usually larger than the price of fixing it.

Key takeaway

A voice agent at 380 concurrent calls is a telephony, data, and GDPR project before it is an AI project. The model is the easy part.

FAQ

Can a voice agent really handle 380 simultaneous calls?

Yes, but the bottleneck is your SIP trunk and carrier rate limits, not the language model. Plan capacity at the telephony layer first, and pre-warm worker pools before known peaks.

What happens when the agent does not know the answer?

It books the appointment, flags the call for human follow-up in the morning, and tells the caller exactly what to expect next. It never guesses at insurance coverage.

Is recording phone calls legal under GDPR?

Yes, with a lawful basis, caller notice, and short retention. Calls in a health context are special-category data, so keep audio 30 days at most and document the policy where staff see it daily.

How accurate is Dutch speech recognition for dental terminology?

Out of the box, around 11% word error rate in the wild. With per-clinic vocabulary tuning and roughly 30 hours of in-domain audio, we got it under 5% on coverage-relevant words.

voice agentsai agentsautomationragcase studyoperations

Building something?

Start a project