← Blog

Voice agents

Voice agent stacks: Vapi vs Retell vs Twilio at 4,800 calls

A 24-person dental software vendor in Antwerp needed 4,800 reminder calls a day in Dutch and French, with under one second barge-in. We tested three stacks.

Jacob Molkenboer· Founder · A Brand New Company· 5 Dec 2024· 10 min
Vintage black bakelite phone receiver off-hook on ivory leather blotter, green ribbon, red wax seal on cream card.

On a Tuesday in March, the operations lead at a 24-person dental software vendor in Antwerp opened her Vapi billing console and read the number twice. The prior month's outbound reminder calls had cost €31,400. Her CFO had budgeted €12,000. The reminder bookings rate had improved by 9 percent year on year. That mattered less to her than the variance.

She emailed us the next morning. The question on the table was straightforward: was there a voice stack that hit the same recall rate at a third of the spend, without losing the sub-second barge-in her practices in Flanders had quietly come to expect?

Over six weeks we tested three options against the same 1,000-call replay set: Vapi, Retell, and a custom Twilio plus LiveKit plus Deepgram pipeline. The honest answer is that all three can deliver this workload. The useful answer is what they hide in their demos.

What 4,800 calls a day actually looks like

The Antwerp vendor sells appointment software to roughly 1,100 Belgian and Dutch dental practices. Each practice opts in to 24-hour and 2-hour reminders. About 4,800 of those reminders go out as outbound voice calls every weekday, split roughly 65 percent Flemish Dutch and 35 percent French.

A typical reminder has three branches. The patient confirms ("ja, ik kom"), reschedules ("kan ik verzetten naar volgende week?"), or cancels. Cancel and reschedule both hand off to a calendar lookup, then the agent proposes two slots and books one. The successful confirm is the cheap path. The reschedule is the long one. The cancellation is where you find out whether your stack handles a barge-in mid-greeting.

Weighted across pickups and voicemails, the call mix on this account lands at roughly 22 seconds for a confirm, 71 seconds for a reschedule, 14 seconds for a voicemail drop, and 8 seconds for a no-answer (mostly ring time billed by the carrier). That works out to about 32,500 voice minutes a month. Multiply by the all-in rate any vendor actually charges, including their default LLM and TTS pass-through, and the monthly bill writes itself. The CFO's €12,000 budget assumed €0.08 per minute. The Vapi bill assumed €0.18. Neither was wrong on its own terms. They were answering different questions.

The 1,000-call replay set

Before we ran any vendor, we had to capture a representative sample of real calls. The Antwerp vendor had six months of recorded calls from their Vapi flow, with consent banners played at the start of each. We pulled 1,000 of them, stratified to match the production mix: 540 Flemish Dutch, 360 French, and 100 mixed-language conversations (mostly Brussels practices where the patient and the agent end up speaking different languages by the second turn).

The replay harness was a small Twilio app that placed an outbound call to a number routed to each vendor in turn, then played the recorded patient side as the answering audio over the media stream. Each vendor's agent heard the same 1,000 patient utterances in the same order. We recorded both sides of every call and computed three metrics: audible barge-in latency, end-to-end task success (did the confirm or reschedule resolve correctly with the right slot booked), and total billable minutes by line item, scraped from each vendor's usage export and the carrier's CDRs.

One thing the replay rig made obvious: vendor demos almost never use real recorded patient audio. They use a human speaking cleanly into a headset. Real patient audio includes a barking dog at 3 in the afternoon, a car radio on FM, and a grandmother who starts answering before the agent finishes its second word. Two of the three vendors had a noticeable drop in confirm recall on the mixed-language and noisy subsets. The custom stack, tuned with Deepgram's multi-language model and a slightly more aggressive VAD, held steady. None of that shows up in a polished demo.

Vapi as the default starting point

Vapi is what you reach for when you need a working voice agent in an afternoon. The web console gives you function calling, a structured system prompt, and a phone number on the same screen. For the Antwerp pilot, the in-house team had a Dutch-language confirm-only flow live within a day, and a French version forked from it the day after. That speed is genuinely the product.

What surprised them, six months in, was the cost composition. Vapi's published platform fee is the cheapest line on the invoice. The expensive lines are the model and the voice. By default Vapi pipes through OpenAI plus ElevenLabs at list rate. On a 22-second confirm call, ElevenLabs alone accounts for roughly 40 percent of the per-minute cost. Swap in Cartesia or PlayHT and the platform fee becomes meaningful again, but you have lost the original one-console, one-bill appeal that brought you to Vapi in the first place.

Barge-in latency, measured as the time between the user starting to speak and the agent's audio audibly cutting out, sat at 780 milliseconds on our 1,000-call replay. Acceptable for the use case. Not state of the art. The bottleneck was the default VAD threshold, which Vapi exposes but not transparently.

Retell and the orchestration tax

Retell is the option founders short-list when Vapi feels too opinionated. It treats the LLM, the STT, and the TTS as first-class swappable plugins, and exposes a cleaner state machine for multi-turn flows. Their documentation is candid about the moving parts and the responsibility split.

That separation is the feature, and the cost. The Antwerp team's Vapi flow had a single system prompt and a function-call schema. Porting it to Retell required modelling the confirm, reschedule, and cancel branches as explicit nodes, with transition conditions on each. Production-grade, yes. Two engineering weeks, also yes.

Barge-in landed at 640 milliseconds with Retell's default endpointing, and at 510 milliseconds when we shortened the silence threshold to 110 milliseconds. Cost per minute came in 12 percent below Vapi at the same TTS choice, and 31 percent below if you accepted Cartesia in place of ElevenLabs. The confirm recall rate held within a percentage point of Vapi. The honest summary on Retell: faster on barge-in, cheaper on the bill, more work in the build.

Rolling Twilio plus LiveKit plus Deepgram

The third option is what you build when the per-minute bill, multiplied by your volume, exceeds your platform engineer's salary. At 32,500 minutes a month, the Antwerp vendor was past that threshold and knew it.

The architecture is unglamorous and well documented. Twilio handles the carrier leg with their Media Streams WebSocket, streaming μ-law audio frames from the PSTN call to your backend. LiveKit's Agents framework brokers the room and pipes audio between Deepgram for streaming STT, your chosen LLM for reasoning, and a TTS provider of your choice. A small Silero VAD model sits in front for endpointing, deciding when the user has actually stopped speaking versus paused mid-sentence.

A minimal LiveKit agent looks like this:

from livekit import agents
from livekit.plugins import deepgram, openai, cartesia, silero

async def entrypoint(ctx: agents.JobContext):
    agent = agents.VoicePipelineAgent(
        vad=silero.VAD.load(min_silence_duration=0.12),
        stt=deepgram.STT(model="nova-3", language="multi"),
        llm=openai.LLM(model="gpt-4o-mini"),
        tts=cartesia.TTS(voice="nl-be-female-1", model="sonic-2"),
        allow_interruptions=True,
        interrupt_speech_duration=0.10,
    )
    await ctx.connect()
    agent.start(ctx.room)

The custom stack delivered the lowest barge-in number of the three: 420 milliseconds at the 95th percentile, after tuning Silero's silence window down to 120 milliseconds and shortening Cartesia's first-chunk buffer. Cost per minute, including the Twilio Belgium outbound rate, LiveKit Cloud participant minutes, Deepgram Nova-3 streaming, GPT-4o-mini, and Cartesia Sonic-2, settled at €0.061. Roughly a third of the Vapi all-in figure.

That number comes with engineering debt. A platform like Vapi quietly handles SIP retries, codec negotiation, jitter buffer tuning, and number warm-up. The day you stop paying for them, you start handling them. Plan for two engineers full time for the first six weeks and one engineer at twenty percent for the year after.

The barge-in number the demos hide

Every vendor demo measures latency from the moment their backend acknowledges the user's audio. That number is not what the patient on the line actually hears. The patient hears the time between starting to speak and the agent's voice going quiet.

Three things sit between those two events: the carrier media path, the VAD silence threshold, and the TTS interrupt logic. Vendors love to advertise the second one. They are quieter about the third. A TTS engine that has already streamed 800 milliseconds of audio into the patient's ear has 800 milliseconds of audio still to flush before silence returns, no matter how fast the VAD fires.

Warning

If your vendor advertises "200 ms barge-in" without specifying whether that is server-side detection or audible interrupt, assume the worst. The number patients actually experience is typically 2 to 3 times the marketing figure.

Deepgram's published streaming latency for their Nova family sits in the 150 to 300 millisecond band, which is real. Adding that to a 100 millisecond VAD window and a 150 millisecond TTS flush gets you the 420 millisecond figure we measured. The marketing tagline would have been "150 ms". The patient experience is 420. Both are true.

Per-minute math at this volume

The all-in rates we landed on, across the same 1,000-call replay set with identical flows in both languages:

  • Vapi with OpenAI plus ElevenLabs defaults: €0.182 per minute
  • Vapi with Cartesia swapped in: €0.131 per minute
  • Retell with OpenAI plus ElevenLabs: €0.161 per minute
  • Retell with Cartesia: €0.112 per minute
  • Twilio plus LiveKit plus Deepgram plus GPT-4o-mini plus Cartesia: €0.061 per minute

At 32,500 minutes a month, the spread between the most expensive and the cheapest configuration is €3,930. Per year, €47,160. That is not a rounding error on a 24-person company. It also is not the only number. The custom stack pushed two engineers into voice infrastructure for six weeks. Loaded cost at Belgian rates, roughly €40,000. Year one breaks roughly even on the savings. Year two is where the surplus shows up, and year three is where you have a real moat against the next competitor still paying €0.18 a minute to a managed platform.

What we shipped

The Antwerp vendor went live in May on the custom Twilio plus LiveKit plus Deepgram stack, with Retell kept as a hot standby behind a feature flag for the French-language flow in case of regression. When we built their voice agent, the thing we ran into that nobody warns you about was the codec mismatch between LiveKit Cloud (Opus at 48 kHz) and Twilio's PSTN leg (μ-law at 8 kHz), which introduced a 60 millisecond transcode penalty we only caught by capturing raw RTP on both sides. We ended up running our own LiveKit egress node in Frankfurt and pinning the SIP transcoder to that region, which dropped the round-trip by 90 milliseconds and resolved a strange "the agent sounds like it is underwater" complaint from three Walloon practices.

The smallest thing you can do today, before any vendor comparison matters, is open your current voice provider's billing export and split the per-minute cost into platform fee, LLM, STT, and TTS. If you cannot, that is your first answer.

Key takeaway

The right voice stack is the one where the per-minute bill, times your annual volume, exceeds the cost of two engineers for six weeks.

FAQ

What is the barge-in latency a patient actually hears?

It is the time between the patient starting to speak and the agent's audio going silent, not the time the vendor backend takes to detect speech. Expect 2 to 3 times the marketing figure.

When does rolling your own voice stack beat using Vapi or Retell?

When your annual voice spend on a managed platform exceeds the cost of two engineers for six weeks. Below that line, use a managed platform. Above it, build and own the pipeline.

Does the TTS provider really change the per-minute bill that much?

Yes. On short reminder calls, the TTS provider can be 30 to 40 percent of the per-minute cost. Swapping ElevenLabs for Cartesia or PlayHT usually cuts the bill by a third without losing intelligibility.

Can you run more than one voice stack at once for redundancy?

Yes. The Antwerp vendor runs the custom stack as primary and a Retell flow behind a feature flag as hot standby for the French path. Failover is one config change at the router.

voice agentsai agentsautomationintegrationsarchitecturecase study

Building something?

Start a project