Voice agents

Voice agents at a Dutch dental chain: LiveKit, Vapi, DIY

Friday 18:47, the NZa just changed a tariff, forty patients are in the queue. The cost answer and the audit answer were different stacks. The audit won.

Jacob Molkenboer· Founder · A Brand New Company· 5 Jun 2026· 9 min

Vintage black bakelite phone receiver off-hook on cream leather blotter, green ribbon, wax-sealed card, linen surface.

It's Friday 18:47. The NZa has just published a mid-year revision to the C22 consultation tariff. Forty people are calling the front desk of a 20-person dental chain in Haarlem to ask whether their Monday-morning check-up will cost more. There is no front desk. The front desk went home at 17:30. The voice agent picks up call 1, call 2, call 3. Six minutes from now it will hit a question it does not have a confident answer for, and someone is going to need to be in front of a keyboard.

Last winter we built that voice agent. We had three serious candidates for the voice layer: LiveKit Agents, Vapi, and a hand-rolled stack on Twilio Media Streams with a Claude tool-use loop. This post is what we learned scoring them against the only three things the practice owner actually cared about.

The brief, in one paragraph

5,800 conversations a week, almost all of them recurring half-hour appointments: check-ups, hygiene cleanings, kid orthodontics, the occasional reschedule. Hours 08:00 to 20:00 on weekdays, Saturday until 13:00. The agent had to read from the Exquise calendar over a stable API, push tentative bookings back, send the SMS confirmation, and hand off to a real receptionist the second the caller said the word pijn or anything close to it. Every gesprek had to survive an audit under Wkkgz and the KNMT praktijkrichtlijn. And the practice owner had one personal number she wanted moved: the average Monday wait time at the human front desk was 4 minutes 11 seconds. She wanted it under 90 seconds.

Per-call cost at 5,800 conversations a week

We ran a representative two-week pilot on each stack. Same Deepgram nova-3 streaming STT in eu-west, same Cartesia Sonic-2 Dutch voice, same Claude Haiku 4.5 model behind the tool loop. The pilot averaged 1 minute 28 seconds per completed call. Here is what landed in the spreadsheet:

Vapi (managed)
  Vapi platform        $0.05  / min
  Twilio SIP           $0.014 / min
  Deepgram STT         $0.0077/ min
  Cartesia TTS         $0.020 / min (egress)
  Claude Haiku 4.5     ~$0.018 / call
  -----------------------------------
  Per call (~1m28s):   ~$0.155
  Per week:             ~$899
  Per year:           ~$46,700

LiveKit Cloud + own LLM
  LiveKit Cloud        $0.005 / min
  Twilio SIP           $0.014 / min
  Deepgram STT         $0.0077/ min
  Cartesia TTS         $0.020 / min
  Claude Haiku 4.5     ~$0.018 / call
  -----------------------------------
  Per call:            ~$0.087
  Per week:             ~$505
  Per year:           ~$26,260

Twilio Media Streams + Claude tool loop
  Twilio Media         $0.0125/ min in + $0.0085 stream
  Deepgram STT         $0.0077/ min
  Cartesia TTS         $0.020 / min
  Claude Haiku 4.5     ~$0.018 / call
  Hetzner CCX23         EUR 30 / month
  -----------------------------------
  Per call:            ~$0.077
  Per week:             ~$447
  Per year:           ~$23,250

Headline: managed Vapi is roughly twice the per-call cost of the hand-rolled loop. At this volume that delta pays for one extra dental hygienist by year two. It is not a rounding error. It is also not the deciding factor, because the deciding factor is the next section.

Wkkgz-defensibility under the KNMT praktijkrichtlijn

Wkkgz wants three things from a voice agent: a clear consent moment at the top of the call, a faithful record of what was said on both sides, and a defensible escalation path the moment the caller says something that could be clinical. The KNMT guideline layers on data-minimisation expectations. You do not keep raw audio of mijn dochter heeft een gat in haar kies longer than you need it to do the booking.

Vapi stores transcripts in the US by default, and you have to ask, in writing, for an EU-only retention tier. The audit log is theirs. If you get a Wkkgz complaint six months from now and need to reconstruct exactly which version of the prompt was live on March 14th at 11:08, you are filing a support ticket. We have done this on another project. It works. It is not fast, and it is not in your hands.

LiveKit Cloud has an EU region in Frankfurt and a self-host option, which was the door we wanted left open. The transcript and the audio are yours from minute one. The audit log is whatever you wire up, which is the cost.

The Twilio Media Streams stack we built ran the orchestrator on a Hetzner box in Falkenstein, wrote every turn of the conversation to a Postgres in the same VPC, and never let raw audio leave Europe. Every prompt change was a git commit on a branch that required two approvals. When the practice manager asked, six weeks after go-live, which prompt was live the day Mrs. Jansen called about the implant, the answer was a git log away.

Takeaway

For Wkkgz, the question is not "is this vendor compliant?" It is "can you reconstruct, with version control, what your agent said on a specific Tuesday?" Two of these three stacks make that easy. One does not.

The Friday-night NZa patch

Back to the opening scene. The tariff for code C22 just changed. Forty people are in the queue. Who edits the prompt, and how long does it take to land in production?

On Vapi, the answer is good. Anyone with dashboard access edits the system prompt in a textarea, hits save, and the next call gets the new version. The practice manager can do it from her phone in the train. That is genuinely useful at 19:00 on a Friday. The flip side: she can also break it from her phone, there is no review step in front of her, and the version history is shallow. On another project we watched a real production prompt get clobbered by an accidental paste from a WhatsApp thread.

On LiveKit and on our hand-rolled stack, the prompt is code. Updating it on a Friday night means someone with deploy access has to open a PR, get an approval, watch CI, and roll. We measured this end-to-end at 11 minutes for a one-line change. That is fine for a tariff update. It is also fine for a typo. It is not fine for "the agent is hallucinating a procedure code, take it down now," and the answer there is not to make deploys faster, it is to have a feature flag.

So we built one. A single environment variable, flippable from a Slack slash command by the on-call engineer or by the practice manager herself, that takes the agent from "book the appointment" to "thank you for calling, our team will call you back tomorrow morning." Average time from "this is going wrong" to "the agent is quiet and politely promising a callback": 14 seconds. Average time to ship a real prompt fix: still 11 minutes.

The handoff is where the safety lives

Wkkgz does not care what the LLM said. It cares what the practice did with what the caller said. The hand-rolled loop ran every caller utterance through two parallel checks before responding: an intent classifier and a single-token clinical-risk flag (pain, swelling, blood, broken, child, kid, medication). Either flag tripped, and the agent stopped trying to book. It said one sentence, transferred the call to the on-call number, and wrote a row to the handoff table that the morning team triaged at 07:45 the next day.

This is the part that an off-the-shelf voice platform can do, but only if you stop treating the system prompt as a single instruction and start treating it as a state machine. The skeleton:

type State = 'greet' | 'intent' | 'slot' | 'confirm' | 'handoff'

async function turn(state: State, transcript: string, ctx: CallCtx) {
  const risk = await classifyRisk(transcript)        // 1 token, ~40ms
  if (risk.flagged) return { next: 'handoff', say: HANDOFF_LINE }

  switch (state) {
    case 'greet':   return askIntent(ctx)
    case 'intent':  return routeIntent(transcript, ctx)
    case 'slot':    return fillSlots(transcript, ctx)  // calls Exquise
    case 'confirm': return readBack(ctx)
    case 'handoff': return transfer(ctx)
  }
}

Each state gets its own tight prompt and a strict tool list. The greet state cannot call book_appointment. The slot state cannot end the call. That separation is what made the agent defensible on paper and reliable on the phone.

What we picked

We shipped the hand-rolled Twilio Media Streams + Claude tool-use loop. Three reasons, in order of weight:

Wkkgz reconstruction. Git-versioned prompts, audit log in our Postgres, audio that never crossed the Atlantic.
Cost at scale. Half the per-call price of Vapi at 5,800 calls a week pays for the engineer who maintains it, with change left over.
Tool surface. The agent talks to Exquise, the SMS confirmation system, a soft-booking holding table, and a Slack escalation channel. Wiring four tools into a Claude loop is about a hundred lines of TypeScript. Wiring four tools into a managed platform is doable, but the orchestration logic ends up split across the vendor dashboard and your code, which is the worst of both worlds.

On the practice owner's number, the Monday-morning wait time came in under the ninety-second target by week three and held there through the spring. That, more than the audit story, was the metric the partners quoted back to her in the quarterly meeting. The audit story is what made the project defensible. The wait time is what made it a yes.

This is not an argument against Vapi. For a five-person plumbing company that needs an after-hours intake agent next week, Vapi is the right answer and we recommend it monthly. The break-even is somewhere around 1,000 weekly calls and an industry where the audit story matters. Below that, you are paying an engineer to save fifty euro a month.

The two things we got wrong

First: barge-in. Dutch callers interrupt. A polite English-speaking agent will keep going for another full sentence, and a Dutch caller hears that as the agent ignoring them and hangs up. We needed the VAD to cut TTS playback inside 180 ms. Our first pass was at 600 ms. Six hundred milliseconds is the difference between "this works" and "this is broken." The fix was a server-side endpointing model on the inbound stream, not the model the TTS vendor ships with.

Second: the model is not the agent. We started with a 1,200-word system prompt and a single Claude call per turn. Reliability was bad and we could not tell why. We rebuilt it as the state machine above, with a separate Claude call per state, each with a tight prompt and a strict tool list. Hallucinations dropped to near zero. This matches the pattern in Anthropic's writing on building effective agents: the reliability win is in the orchestration, not in the model.

The smallest next step

If you are running a voice agent today and you cannot answer the question "which prompt was live at 14:33 last Thursday?" with a command, that is the audit you can run in five minutes. Open your repo, your dashboard, your wiki, whatever the prompt actually lives in, and try to reconstruct yesterday's version. If you can't, that is the first thing to fix, whatever stack you are on.

When we built the voice agent for that Haarlem dental chain, the thing we ran into was that the cost answer and the audit answer pointed at different stacks. We ended up solving it with a git-versioned prompt, a Postgres audit log in Frankfurt, and a Slack kill-switch the practice manager controls from her own phone.

Key takeaway

For voice agents in regulated work, the cost answer and the audit answer usually point at different stacks. Pick the audit answer.

FAQ

Is Vapi compliant with Wkkgz and the KNMT praktijkrichtlijn?

Vapi is usable for Dutch healthcare, but the audit and data-residency story requires an EU retention tier and a support-ticket workflow to reconstruct historical prompts. That works; it is not fast and it is not in your hands.

When does a hand-rolled Twilio + Claude voice agent beat a managed platform?

Roughly above 1,000 weekly calls, or in any regulated context where you need git-versioned prompts and EU-resident audio. Below that, the engineering cost outweighs the per-call savings and Vapi is the better answer.

How fast can you patch a voice agent's prompt when a tariff changes on a Friday night?

On Vapi, seconds via the dashboard, with no review gate. On LiveKit or a hand-rolled stack with a normal deploy pipeline, around 11 minutes per PR, plus a Slack-controlled kill-switch for the genuine emergencies.

What was the one architectural change that made the agent reliable?

Splitting the single 1,200-word system prompt into a state machine with a separate Claude call per state, each with a tight prompt and a strict tool list. Hallucinations dropped to near zero once the model could no longer call the wrong tool.

voice agentsai agentscase studyarchitectureoperationsstrategy

Building something?

Start a project