Voice agents

Stress-testing voice agents: a Dutch sceptic and a judge

An outbound voice agent is ready to call its first real customer on Monday. Before it does, we put it on the phone with a Dutch sceptic and a judge model.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jun 2026· 8 min

Black bakelite desk phone receiver off-hook on leather blotter, green silk ribbon across cradle, wax-sealed card beside.

It is Sunday night. An outbound voice agent that we built for a Dutch logistics client is supposed to start phoning warehouse managers on Monday at 09:00 to confirm pickup windows. The happy path works. We have heard the agent run through the script forty times with our own ops lead on the other end, playing nice. Nobody on the team wants to be the one who explains, on Tuesday, that the first real call went sideways because a manager in Tilburg said something the agent had never seen in testing.

So before the agent makes a real call, it spends the weekend on the phone with a different kind of customer. A Dutch sceptic, played by a second model. Every transcript scored by a third. A failing score blocks the Monday deploy.

Why happy-path tests miss the worst failures

Outbound voice agents fail in a specific way. They sound competent until they meet a person who is not in a hurry to agree with them. A warehouse manager who has been at the dock since 05:30 is not going to let an unfamiliar voice walk him through a five-step confirmation. He interrupts. He asks why. He gives one-word answers that are technically yes but mean no. He pauses for nine seconds and then says one short sentence that the agent has to handle without losing the thread.

None of that shows up in scripted QA. The ops lead who tested the agent forty times has a kind voice and a stake in the agent working. He is the wrong adversary. The right adversary is bored, busy, and slightly suspicious of the whole concept. The kind of person who, in another tab, is reading the Hacker News thread asking why the HN crowd is so anti-AI and quietly agreeing with the top comment.

The rig in three parts

We red-team outbound agents against three components before the first real call. A transcript-replay layer that drives the agent without burning telephony minutes. A sceptic persona that plays the worst-case customer in the agent's target language. A judge model that scores every transcript on five dimensions and fails the run if any one of them is below threshold.

The whole loop runs in CI. A failing run blocks the deploy.

The transcript-replay layer

Real phone calls are expensive and slow to iterate. For the red team we strip the audio stack and run the agent as text-in, text-out. The same system prompt, the same tools, the same memory, the same model. Just without TTS and STT in the path.

This catches roughly 80% of the failures that would show up on a real call, at maybe 1% of the cost. The remaining 20% are audio-specific. Barge-in handling. Silence detection. Accents the STT model has not heard. Those need a real audio rig. The transcript rig lets you delay the audio rig until the logic is solid.

import asyncio
from rig import outbound_agent, sceptic, judge

async def run_red_team(scenario):
    history = []
    agent = outbound_agent.start(scenario)
    customer = sceptic.start(scenario)

    for _ in range(20):
        line = await agent.say(history)
        history.append(("agent", line))
        if customer.wants_to_hang_up(line):
            break
        reply = await customer.respond(history)
        history.append(("customer", reply))

    return await judge.score(history, scenario)

The Dutch sceptic persona

The second component is the part where most teams cut corners. They write a single "rude customer" prompt and ship. That is not enough.

A useful sceptic prompt does five things. It interrupts on a fixed schedule. It refuses the first answer at least once per call. It demands a source for any number the agent quotes. It throws in regional details the agent should not know. And it ends the call abruptly if it detects a hallucination, so the rig can score "lost the customer" as a real outcome rather than as an incomplete run.

We pin the sceptic in Dutch because most of our voice agents run in Dutch and the failure modes are language-specific. A Dutch sceptic asks "waarom?" in a way that an English-prompted model translating to Dutch will not. The local register matters. Sceptics in the wild do not perform scepticism. They are just busy.

persona: dutch-warehouse-manager
language: nl-NL
register: direct
energy: medium-high, slightly impatient
behaviors:
  - interrupt after 8 seconds of agent monologue
  - reply "ja" when the answer should be "nee" (about 1 in 10 turns)
  - ask "waarom?" the first time the agent quotes a fact
  - demand a source for any number
  - end the call if the agent contradicts an earlier tool output
forbidden:
  - long sentences
  - politeness rituals after the first turn
  - explaining what you do for a living

The judge model

The third component is the judge. A second model reads the transcript after the call and scores it on five dimensions. Five because three is too coarse for a useful signal and ten is too much for a human to read at a glance.

A judge model is not a perfect reviewer. There is a body of work on judge bias toward longer answers, toward the model's own family, toward confident tone over correct content. The original LLM-as-a-Judge paper by Zheng et al. is still the right starting point on this. We mitigate the bias two ways. We use a different vendor's model for the judge than for the agent. And we calibrate the judge on a labelled set of fifty real calls before we trust its score on a new run.

Warning

Never use the same model family for the agent and the judge. The judge will quietly score its own family's outputs higher and you will pass a deploy that should have failed.

{
  "dimensions": [
    { "name": "factuality",            "weight": 0.30, "fail_below": 0.70 },
    { "name": "scope_discipline",      "weight": 0.25, "fail_below": 0.80 },
    { "name": "tone_match",            "weight": 0.15, "fail_below": 0.60 },
    { "name": "handoff_clarity",       "weight": 0.20, "fail_below": 0.70 },
    { "name": "interruption_recovery", "weight": 0.10, "fail_below": 0.50 }
  ],
  "pass_if": "weighted_score >= 0.75 AND no_dimension_below_floor"
}

A failure the rig caught

On the logistics agent, the rig flagged a failure that would have been awkward on a live call. The agent had a tool that returned the next available pickup slot. The tool worked. But when the sceptic asked "waarom is dat het eerste moment?", the agent invented a reason. It said the depot was closed for inventory on Tuesday afternoon. There was no such inventory.

The judge flagged it under factuality at 0.42. The transcript made the failure obvious. The fix was to add one line to the system prompt: if the tool does not return a reason, the agent says "dat is wat het systeem mij geeft, ik kan een collega laten terugbellen als u de exacte reden wilt weten" and offers a human handoff.

That fix took ten minutes. Finding the same failure on live calls would have taken weeks and probably one angry email to the client.

What the five dimensions actually mean

Factuality. Did the agent say anything that is not in the tool output, the knowledge base, or basic shared world knowledge. This is the dimension that catches hallucinated reasons, invented prices, and confident-sounding nonsense about company policy.

Scope discipline. Did the agent talk about anything outside the script. The cheapest way to lose a customer's trust is to volunteer an opinion on something you should not have an opinion on. A pickup-confirmation agent that drifts into a comment on fuel prices has already lost.

Tone match. Did the agent sound like the brand at the same energy level the customer is using. Dutch B2B logistics is direct. An agent that opens with "I hope you are having a wonderful morning" has misread the room before it has said anything else.

Handoff clarity. When the agent decides it cannot handle the call, does it say so cleanly with a clear next step. "Een collega belt u binnen het uur terug" beats "let me see if I can find someone who can help with that" by a wide margin.

Interruption recovery. When the customer cuts the agent off, does the agent acknowledge the interruption and respond to what was actually said, or does it finish its previous sentence as if nothing happened. This is the dimension that separates a passable voice agent from one a real person can stand to talk to.

Where the rig fits in the pipeline

For a new outbound agent, the full pipeline we run looks like this.

Build agent against the scripted happy path.
Run twenty scripted scenarios in the transcript rig.
Run twenty adversarial scenarios with the sceptic persona.
Pass the judge gate at 0.75 weighted score, no dimension below its floor.
Move to a real audio rig with synthetic voices on the STT and TTS round-trip.
Five real calls to internal numbers, recorded.
Pilot with one segment of real customers, hand-monitored.
Full rollout.

Steps 1 through 4 run in CI on every change to the prompt or the toolset. We do not move to step 5 until the gate passes. There are off-the-shelf eval scaffolds you can adapt for this. The OpenAI evals repo is a reasonable starting point to fork and rewrite in your customer's language. The difference, in practice, is between catching the inventory hallucination on Sunday night and catching it on Tuesday morning from a warehouse manager in Tilburg.

Calibrating the judge

One last note on the judge. The first time you run the rig, the judge will be wrong. It will pass runs that a human reviewer would fail and fail runs that a human reviewer would pass. The fix is not to swap the model. The fix is to label fifty real or synthetic transcripts by hand, run the judge against them, and tune the rubric until the judge agrees with the human on at least 85% of the pass/fail calls.

This takes a long afternoon. You do it once per agent. After that, you only re-calibrate when the agent's domain changes or you change the judge model. Without this step, you have a number that looks rigorous and means nothing.

When we built the outbound voice agent for the logistics client referenced above, the first version of our sceptic was too polite and missed the inventory hallucination on three runs in a row. We rewrote it flatter and busier and it caught the failure on the next pass. If you ship outbound voice agents and have not put your last one through a sceptic and a judge, the audit worth running this week is this: write the sceptic in your customer's actual language, score on factuality first, and gate the deploy on the result.

Key takeaway

Before any outbound voice agent calls a real customer, run it against a sceptic in the customer's language and gate the deploy on a separate judge model's score.

FAQ

Can the agent model and the judge model be the same?

No. A judge tends to score outputs from its own model family higher. Use a different vendor for the judge, or at minimum a different model size, and calibrate against human labels.

How many adversarial scenarios is enough?

Twenty is a reasonable floor for a single-purpose outbound agent. Each scenario should target a different failure mode: refusal, interruption, hallucination bait, scope drift, sudden language switch.

Does the transcript rig replace real-call testing?

No. It gates it. The transcript rig catches the logic failures cheaply. You still need a real audio rig and a small pilot before full rollout to catch barge-in, silence, and STT failures.

What if the agent runs in a language other than Dutch?

Pin the sceptic in that language and write the persona in-language. A translated sceptic prompt misses the register and idiom that real customers use to push back.

voice agentsai agentstoolingoperationsautomation

Building something?

Start a project