Voice agents
Voice ordering in three Dutch dialects: a Rotterdam build
Friday night in Rotterdam West, four ovens going, the phone won't stop. Here is how we shipped a voice agent that handles three Dutch dialects without garbling the order.

Friday night, Rotterdam West. Four pizza ovens going, a kapsalon station, two riders waiting at the door. The phone rings sixty times in an hour. A sixteen-year-old new hire picks up, tries to take an order in textbook Dutch from a caller who is plainly speaking Rotterdams, half-shouting over a passing tram. The order goes in wrong. Two kapsalons land at the wrong street. The owner calls us on Monday.
This is the build log for the voice agent we shipped to a four-location Rotterdam restaurant chain over April and May. The brief was simple to say and ugly to execute: pick up the phone, take the order, push it into the POS, and confirm it back in the same dialect the caller used. Three dialects were on the table.
The dialect problem nobody benchmarks
Roughly 40% of the chain's orders still come in by phone. Thuisbezorgd and their own site handle the rest. Phone orders are where the errors cluster, and where labour cost shows up most painfully on a Friday or Saturday night. The owner wanted an agent that could replace one phone-line worker per location during peak.
We scoped three dialects:
- Standaardnederlands. The news-anchor default. Whisper-large-v3 handles this at 4 to 6% word error rate out of the box.
- Rotterdams. Glottal stops, dropped final -n, the ij to ai shift, flattened ui diphthongs. Whisper baseline word error rate on our sample: 31%.
- Turks-Nederlands. Code-switching common in West and South Rotterdam. Words like abi, kanka, and yok drop into otherwise-Dutch sentences mid-order.
Before any tuning we benchmarked five providers on 200 real calls from the client's archive (consented, GDPR-scrubbed, hand-transcribed by a Rotterdam-born intern). Whisper-large-v3, Deepgram Nova-3, AssemblyAI, Azure Speech, and Google Chirp. None of them broke a 25% word error rate on Rotterdams. That number was the whole problem in one figure.
Architecture
The stack we landed on, top to bottom:
- Telephony: Twilio Programmable Voice, with a SIP trunk to the chain's existing PBX as the fallback path.
- Voice activity detection: Silero VAD, with an RNNoise pre-pass for tram and market noise.
- Primary ASR: Whisper-large-v3 with a LoRA adapter trained on Rotterdams.
- Fallback ASR: Deepgram Nova-3. Triggered whenever the primary returns a confidence below 0.72 for a span.
- Intent and slot filling: a structured-output schema for the chain's menu.
- Text to speech: ElevenLabs Turbo v2.5, three custom voices.
- POS: Lightspeed K-Series, REST.
- State machine: XState, one Node worker per call.
The two-ASR setup is the one piece I would defend on its own. Running every audio span through both engines is expensive. Routing only the low-confidence spans through the fallback costs us roughly €0.04 per call on average and recovers about 8% of orders that would otherwise need a human handoff. The maths makes itself.
Latency budget: we hold a 700 ms ceiling between caller pause and agent response. Anything over a second feels broken on a phone call. The biggest single contributor was ElevenLabs TTS warm-up, which we mitigated by pre-rendering the eighty most common confirmation snippets and streaming them while the slot filler finishes the rest.
Fine-tuning Whisper on Rotterdams
You cannot ship a Rotterdams ASR without Rotterdams data. There is no clean public corpus. We posted on Marktplaats for paid speakers (€25 per hour, scripted and free-form), collected eight hours of fresh recordings, and combined that with the client's call archive. After augmentation (speed perturbation and room reverb simulation) we had roughly 14 hours of labelled audio.
We tagged the patterns we expected to break the model:
phonetic_rules:
- id: final_n_drop
examples: ["lopen -> lope", "kapsalonnen -> kapsalonne"]
- id: ij_to_ai
examples: ["fijn -> fain", "kijken -> kaiken"]
- id: ui_flatten
examples: ["uitjes -> oitjes", "huis -> hois"]
- id: glottal_stop_t
context: "word-final t after a short vowel"
- id: r_uvular_drop
context: "Rotterdam West specifically"
We trained a LoRA adapter rather than a full fine-tune. Six epochs on a single A100, about four hours of compute. Rotterdams word error rate dropped from 31% to 9.4%. Standaardnederlands performance stayed within noise of the base model, which mattered: we did not want to trade one dialect for another.
Turks-Nederlands was a different problem. The Whisper base model already transcribes most Turkish loan words correctly. The break point was the downstream slot filler, which would occasionally interpret abi as a name field. We solved that with a small Dutch-Turkish code-switch dictionary loaded into the prompt as a hint, plus a regex pre-pass on the transcript before it hits the structured-output call.
The menu schema is half the work
Voice agents in restaurants succeed or fail on how the menu is modelled. We spent two weeks on the schema alone. The chain's menu has 84 base items, six topping families, and modifier rules that look trivial until you write them down (kapsalon mayo goes on the meat or on the chips, not both unless the caller explicitly asks for both, and the Veluwseweg location uses a different sauce than the others). The schema validates every utterance, and any unresolved ambiguity triggers a clarifying question rather than a guess.
The schema also encodes pricing, which lets the agent quote the total before confirmation. About a third of callers correct their order at that step ("nee, dan toch maar zonder kaas"). That is a third of orders that would have gone in wrong with no quote.
Matching the confirmation voice to the caller's dialect
Every order gets read back to the caller before the line drops. In their own dialect.
A caller speaking Rotterdams who hears their order read back in textbook Dutch feels patronised. A caller speaking Standaardnederlands who hears it back in heavy Rotterdams thinks the agent is broken. Register-matching is not cosmetic. It is what makes the confirmation believable.
The first 2 to 3 seconds of the call run through a tiny dialect classifier sitting on top of the Whisper encoder embedding. Three classes, one per dialect, plus an "uncertain" bucket. The classifier picks the ElevenLabs voice for the rest of the call. We cloned three speakers: one news-anchor Standaardnederlands, one Rotterdams native (a fishmonger from the Markthal who took the gig for fun), and one bilingual Turks-Nederlands speaker who works as a translator in the city.
Cost per confirmation in production: roughly €0.018 given typical order length. Worth every cent.
Edge cases that ate our time
Three-way calls. A caller speaking Rotterdams whose partner shouts additions in Turks-Nederlands from the next room. Diarisation gives us two transcript tracks; the slot filler merges them with a timestamp-ordered concat and a "multiple speakers" flag on the kitchen ticket, so the cook knows to expect a chaotic pickup.
Tram noise. The RET network runs at street level past two of the four locations. Silero VAD on raw audio was triggering false positives on tram brake squeal. RNNoise as a pre-pass halved the false-positive rate without measurable damage to recall.
Drunk callers after 22:00 on weekends. Slurring breaks the dialect classifier (everything routes to "uncertain"). We added a time-of-day rule: between 22:00 and 03:00, if classifier confidence stays below 0.6 across the first window, default to Standaardnederlands and politely ask the caller to repeat. It works because drunk callers do not notice register matching anyway.
Mid-call hangups. We persist the partial order in Redis for 90 seconds, keyed by caller ID. If they call back inside that window the agent opens with "U was bezig met drie kapsalonne, klopt dat nog?" Roughly 4% of completed orders go through this recovery path.
The handover-to-human path
An agent that cannot say "I am going to put you through to a person" loses trust the first time it gets something wrong. We built three handover triggers: the caller saying any variant of "kan ik een mens spreken", the dialect classifier returning "uncertain" twice in a row, and the slot filler returning a confidence below 0.55 on any required field. All three route to the on-shift manager's mobile, with a one-line context blurb sent as SMS at the moment of handover: "Klant belt over een eerdere bestelling, agent verstond locatie niet." The blurb saves the manager a minute of "what were we talking about" each time.
About 11% of calls hit one of these triggers. The handover blurb has cut average manager-call time by roughly 40 seconds, which the owner mentioned unprompted on week four. That detail mattered to him more than any of our top-line accuracy numbers.
Results after six weeks live
The agent has been answering all four lines since mid-April. The numbers from the client's POS audit:
- 73% of phone orders complete through the agent without human handoff.
- Average call length: 1 minute 12 seconds, down from 2 minutes 40 seconds when a human took the call.
- Order accuracy at delivery: 96.4%, compared with 91% for human-taken phone orders in the same window a year earlier.
- Two locations have stopped staffing a dedicated phone person during Friday and Saturday peak.
The 91% accuracy figure for human-taken orders surprised the owner more than any of our numbers. People assume humans are the accuracy baseline. They are not, especially at 21:30 on a Saturday with four pizzas in the oven and a queue at the counter.
What we would do differently
We spent too long on Whisper fine-tuning before benchmarking the fallback router. If we had wired the two-ASR confidence routing in first, we would have hit a shippable error rate earlier and bought ourselves time to widen the dialect set. The fine-tune still mattered, but the routing logic is what made the system production-ready.
We also under-budgeted GDPR work. Dutch DPA guidance on inbound call recording is specific about consent prompts, retention windows, and offering a non-recorded line. We built the consent prompt as the first turn of every call and route declines to a human queue, which the client's privacy officer signed off on. It added a week we had not planned for.
The closing piece
When we built this voice agent for the Rotterdam chain, the thing we kept running into was the gap between what general-purpose speech recognition advertises and what it actually does on a real city's mouth. We solved it with a hybrid stack: fine-tune what you can, route around what you cannot, and never let the model confirm an order it is less than 90% sure about.
The smallest useful thing you can do today, if you sit on a phone-heavy operation: pull twenty real call recordings, run them through Whisper-large-v3 as-is, and read the transcripts side by side with the actual orders. The error pattern will tell you whether a voice agent is a tooling problem or a training problem before you spend a euro.
Key takeaway
Match the confirmation voice to the caller's dialect. A Rotterdammer hearing news-anchor Dutch feels patronised, and patronised callers do not correct mistakes.
FAQ
Can off-the-shelf speech recognition handle Dutch dialects?
Standaardnederlands transcribes well at 4 to 6% word error rate. Rotterdams and similar regional dialects come in around 25 to 35% WER, which is unusable for taking orders without targeted fine-tuning.
How long does it take to fine-tune Whisper on a Dutch dialect?
With 10 to 15 hours of labelled audio, a LoRA adapter, and a single A100, you can drop dialect WER from above 30% to under 10% in about four hours of compute. Data collection is the long part.
What is the riskiest part of shipping a voice ordering agent?
Hallucinated order confirmations. The agent has to read every item back to the caller before the line drops, and it has to do that in a register the caller trusts. Skip that step and orders go to the wrong address.
Do you need consent prompts on recorded phone orders in the Netherlands?
Yes. Dutch DPA guidance requires a clear consent prompt at the start of the call, a stated retention window, and an option to continue without recording. Build it as the first turn of the call flow.