Voice agents
Voice agent stack pick: Retell, ElevenLabs, or LiveKit
An agency owner in Maastricht forwards a Loom video at midnight. The voice agent works in Standard Dutch and stumbles on the Limburg accent. Here is the scoring sheet we use.

It is a Tuesday night in May. An agency owner in Maastricht forwards us a Loom video. Her voice-agent demo, built on a managed platform during a hackathon weekend, works fine when she speaks Standard Dutch. The client (a tile-and-bathroom chain with shops in Sittard, Heerlen, and Roermond) just tested it in their own accent. Every fourth utterance comes back as a transcription of something the customer did not say. The agency partner deck promised "natural Dutch conversation". The procurement lead at the tile chain wants a callback by Friday.
This is the situation where we get pulled in. The question we get asked is always some version of: should we use Retell, ElevenLabs Conversational, or build it ourselves? The honest answer is that none of those are wrong by default. The right one depends on three numbers we ask for before anything else.
The three numbers that pick the stack
We do not start with a vendor comparison. We start with three constraints from the actual business.
First, weekly call volume. Below 200 weekly calls the choice barely matters. Above 5,000 the math changes. The interesting band is 800 to 3,000, which is where most sub-€10M Dutch agencies and their clients live. For the Maastricht agency we are working at 1,400 calls per week, mean call length 2 minutes 40, peak Wednesday afternoon.
Second, the accent. There is a real difference between "Dutch" and what people actually speak in Limburg or West Flanders. Whisper-large-v3, the open-weights baseline from OpenAI, handles Standard Dutch well. On Limburgs and West-Vlaams it still drops, but less than most managed platforms. Managed platforms hide which STT they are running, and most still default to a Dutch model trained on Polderlands media.
Third, who gets the call at 03:00 when something breaks. KPN reroutes 088 traffic during maintenance windows roughly twice a quarter. Voxbone (now BICS) and Belgacom have their own. When the SIP trunk renegotiates and the agent stops picking up, somebody on your side has to know to look at the trunk before they blame the LLM.
Once we have those three numbers, the stack pretty much picks itself.
Per-minute cost at 1,400 weekly calls
Let us do the math out loud. 1,400 calls × 2.67 minutes average × 4.33 weeks is around 16,200 minutes per month. Round to 16k for sanity.
Retell's published pricing at the time of writing sits around $0.07 per minute for the base agent (their TTS + STT + LLM passthrough), plus the underlying LLM cost. With GPT-4o-mini on top, you land roughly $0.10 to $0.13 per minute all-in. At 16k minutes that is $1,600 to $2,100 a month, before SIP trunk and number rental.
ElevenLabs Conversational AI lands in a similar band, $0.08 to $0.15 per minute depending on tier and voice clone usage. Cheaper if you commit to a higher Business plan.
Self-hosted on LiveKit Agents + Whisper-large-v3 (via Groq or your own GPU) + Cartesia Sonic for TTS + Claude or 4o-mini as the brain: the marginal per-minute is roughly $0.03 to $0.05 once you account for STT, TTS, LLM, and LiveKit Cloud overhead. But you eat a fixed engineering cost: two weeks of build, then a few hours a month from someone who can read a SIP log.
The honest version of the table looks like this.
Stack €/min 16k min/mo Eng. cost
Retell + GPT-4o-mini €0.10 €1,600 ~1 day setup
ElevenLabs Conversational €0.12 €1,900 ~1 day setup
LiveKit + Whisper-v3 + Cartesia €0.04 €640 ~2 weeks build,
~€600/mo SIP retainer
The crossover point in our experience is around 12,000 minutes per month. Below that, the engineering time saved by Retell or ElevenLabs is worth more than the per-minute markup. Above 15,000 minutes, the hand-rolled stack starts paying for the engineer it requires.
The Maastricht client sits at 16k. We are firmly in the band where either answer is defensible. Cost is not the deciding number.
Barge-in on a Limburgse accent
Barge-in is the moment the customer interrupts the agent. It is the single feature that separates a voice agent that feels human from one that feels like an IVR with extra steps.
Three things have to work for barge-in:
- Voice activity detection has to fire on the customer's first syllable, not their third.
- The TTS has to be cancellable mid-sentence with low latency.
- The STT has to catch the interruption even while the agent is still playing.
This is where the accent matters. Standard Dutch VAD is well-tuned. Limburgs has different prosody: longer vowels, a softer "g", different word stress on multi-syllable words. On a generic VAD trained on Polderlands corpora, barge-in fires late or not at all. The customer says "ja maar" and the agent keeps talking for another 1.5 seconds. By call three of a test session, the procurement lead has already decided the thing is not ready.
Retell uses Deepgram Nova as STT under the hood (last we checked) and their own VAD layer. Deepgram's Dutch model is acceptable on Standard Dutch and weakens on regional accents. ElevenLabs uses their own ASR and is similarly Polderlands-leaning.
Whisper-large-v3 has noticeably better recall on Limburgs and West-Vlaams in our internal tests, because the training data included more Belgian Dutch variants. The catch: Whisper streaming is harder. You need either Groq's hosted Whisper streaming endpoint, or self-hosted faster-whisper with chunked decoding, and you have to bring your own VAD. Silero v5 works fine for Dutch variants once you turn the threshold down to about 0.35.
For the Maastricht client we ran a blind A/B: 40 recorded test calls, half in Standard Dutch, half in Limburgs. Whisper-large-v3 caught the interruption inside 350ms median. Retell caught it inside 700ms on Standard Dutch but 1100ms on Limburgs. Past 800ms, the customer experience falls off a cliff.
That is the number that decided the stack. Not cost. Not vendor logo.
The 03:00 SIP question
The third constraint is the boring one. It is also the one that kills more voice agent projects than anything else.
KPN's 088 number routing runs on SIP through their IMS core. When KPN does a maintenance window (typically a Tuesday or Wednesday between 02:00 and 05:00), they sometimes reroute traffic to a different SBC. If your agent's trunk provider has its peering configured tightly, you get a brief reINVITE storm and a few minutes of failed calls. If it is configured loosely, you get a silent failure where calls connect but audio is one-way.
This actually happens. We have logs.
With Retell or ElevenLabs, your escalation path is a support ticket. Both have status pages. Neither has a Dutch-speaking on-call engineer who knows what an 088 number is at 03:00.
With self-hosted LiveKit and a SIP trunk provider you chose (Twilio, Voxbone, or a Dutch reseller like Voys), your team owns the wiring. You can SSH into the SIP gateway, run sngrep, see the reINVITE, and either fix it or wait it out with a known answer ready when the client calls Wednesday morning.
The right question is not "who is more reliable", because all three are about equally reliable in steady state. The right question is: when something breaks at 03:00, do you want a support ticket or a terminal?
If nobody on your team can read a SIP trace, do not pick the self-hosted stack. The per-minute savings disappear the first time you spend a billable day debugging a one-way audio bug.
The scoring sheet we hand to the client
For the Maastricht agency we ended up sending the partner a single page that looked roughly like this. You can copy the structure for your own decisions.
Question Weight Score 1-5
Do you exceed 12k call-minutes per month? 3 _
Are 20% or more of calls in a regional NL accent? 4 _
Do you have an in-house or retained SIP dev? 4 _
Will the agent take payments or sensitive PII? 3 _
Do you need to ship inside two weeks? 2 _
Weighted total >40: self-hosted LiveKit likely wins
Weighted total 25-40: Retell or ElevenLabs with a sharp prompt
Weighted total <25: managed platform, spend the time on evals
The agency scored 5 on the SIP question (they have a freelance ops dev on retainer), 4 on the accent question (real Limburg client base), and 3 on the volume question (1,400 weekly, growing). The arithmetic pointed at the self-hosted stack. So that is what we built.
We are not zealots about it. For a different client (a Rotterdam logistics dispatcher, 600 calls a week, Standard Dutch only, no in-house engineering), we shipped on Retell in nine days and never looked back.
What we picked, and why it is not the headline
An HN thread last week made the front page about Rio de Janeiro's "homegrown" LLM, which turned out to be a merge of two existing models. The comments were the usual mix of disappointment and shrug. The part the comments mostly missed is that almost every "we built our own" claim in voice AI is also a merge: Whisper from someone, TTS from someone else, an LLM from a third someone, glued with LiveKit or Pipecat. The honest framing for clients is not "we built this from scratch". It is "we picked these four components on purpose, and here is what each one buys you".
That framing also tells you when not to roll your own. If you cannot explain to a client what each component does and why you chose it, ship on Retell or ElevenLabs and put the engineering time into the prompt, the function calls, and the eval set. The voice agent that wins is the one with the best evals, not the one with the most exotic infra.
For the tile chain we ended up with LiveKit Cloud + Whisper-large-v3 on Groq + Cartesia Sonic Dutch voice + Claude as the brain + a Voys SIP trunk. Total per-minute cost landed at €0.041. Barge-in median latency on Limburgs: 380ms. When KPN ran a maintenance window three weeks after launch, our on-call (a freelance SIP engineer in Eindhoven, retainer €600/mo) saw the reINVITE in sngrep within six minutes of the page and flagged the carrier. Calls were back up before the morning standup.
When we built that voice agent stack with the Maastricht agency for their tile-chain client, the hardest part was not the voice. It was figuring out that the eval set needed forty minutes of recorded calls from each shop, not just one location, because the dialect varies between Sittard and Heerlen more than we expected. That is the actual work: picking the right components for a real business with real customers in real accents. The vendor logo is the last decision, not the first.
The five-minute audit you can do today
Pull the last fifty calls your team handled. Listen to ten of them. Count how often the customer interrupts. Count how often the customer speaks in something other than Standard Dutch. Multiply the average call minutes by your weekly volume. Now you have the three numbers. Whichever stack you pick, pick it from those, not from the vendor demo.
Key takeaway
Voice agent stack picks come down to three numbers: weekly minutes, regional accent, and who reads the SIP trace at 3am. The vendor logo is the last decision, not the first.
FAQ
How much does a Dutch voice agent cost per minute?
At 1,400 weekly calls, managed platforms like Retell or ElevenLabs land at €0.09 to €0.13 per minute all-in. A self-hosted LiveKit + Whisper + Cartesia stack lands at €0.03 to €0.05, plus engineering time.
Does Whisper-large-v3 handle Flemish or Limburg Dutch?
Better than the Dutch STT inside most managed voice platforms. Whisper-large-v3 recognises Belgian Dutch and Limburgs noticeably more accurately than Polderlands-trained models, though it still benefits from a custom eval set per region.
Who handles SIP trunk issues when KPN reroutes the 088 number?
On a managed provider you file a support ticket. With self-hosted LiveKit and your own SIP trunk, you (or a retained SIP engineer) read the trace yourself in sngrep. The right choice depends on whether you have someone who can do that.
What is barge-in and why does it matter for voice agents?
Barge-in is when the customer interrupts the agent mid-sentence. If the agent does not stop talking within roughly 500ms, the conversation feels broken. It is the single feature that separates a good voice agent from a glorified IVR.
When should you not build your own voice stack?
When nobody on the team can read a SIP trace, when monthly call volume is below 12,000 minutes, or when you cannot explain to a client what each component in the stack does and why you chose it.