AI agents
Chat agent benchmarks: four numbers that decide if it ships
Before we put a chat agent on a client's site, we run it against the real support team. Four numbers decide whether it ships, and one of them is the silent killer.

It is a Tuesday in May. The client demo is on Friday. The pilot chat agent has been running in shadow mode for three weeks, drafting replies on every real ticket while the human team still answers the customer. The client's head of support asks the question that always gets asked. "Is it actually any good?"
If you cannot answer that with four numbers, you do not get to ship. We have learned this the slow way.
The four numbers
We benchmark every chat agent against the actual support team it would replace or assist, on the same tickets, over the same window. Four numbers decide ship or don't ship.
- Resolution rate
- Quality score
- Time to first useful reply
- Wrong-confident rate
The first three are obvious. The fourth is the one that gets agents pulled from production three months in.
Resolution rate
The share of conversations the agent closed without escalation, where the customer did not reopen the ticket within seven days.
The trap is that resolution rate is trivial to game. Escalate aggressively and every hard question goes to a human, so the agent's kept tickets look clean. Close slow conversations early and the customer simply opens a new one next week. We bake both behaviours into the metric. Escalation above 40% of inbound triggers a separate gate. Reopens count as failures.
The baseline matters more than the headline. We measure the human team on the same definition first, for two weeks. An agent that hits 78% when the human team hits 81% is a perfectly good agent. An agent that hits 95% because it never escalates is broken.
Quality score
A 1-to-5 grade applied to a stratified sample of 200 closed conversations, graded by two passes. First, an LLM-as-judge with a rubric covering helpfulness, accuracy, tone, and completeness. Second, a human spot-check on 30 of those, to calibrate the judge.
The judge model must not be the same model that wrote the answer. We use a different model family for the judge to reduce sycophancy, a failure mode that is well-documented in the public evaluation research and in the original LLM-as-a-Judge paper. We also anchor the rubric with three real examples per score so the judge cannot drift mid-pilot. If the judge drifts, the score drifts, and you start shipping garbage behind a green dashboard.
Time to first useful reply
Not time to first reply. Anyone can fire off "thanks, looking into this" in 200 milliseconds. We measure time to the first message that contains a concrete answer, an actionable instruction, or a clarifying question that actually moves the ticket forward.
For a human team this typically lands in the 4 to 12 minute range during business hours, and hours overnight. For a chat agent this should land under 8 seconds. If it does not, you are doing something architecturally wrong, almost always blocking on a slow retrieval call that you did not need to make synchronously.
Wrong-confident rate
This is the number nobody talks about until it bites.
The share of agent replies that were factually wrong, delivered with no hedging, no "I think," no "let me check." A confident wrong answer is worse than a confident "I do not know." A customer who gets a confident wrong answer trusts the next confident wrong answer too, and you end up apologising for two errors instead of one.
We measure this by sampling 100 closed conversations per week, labelling each agent reply for correctness, and crossing the label with a tone classifier that flags hedging language. Anything above 2% is a fail.
A confident wrong answer is a refund, a regulatory complaint, or a churn event waiting to happen. We have killed pilots over a wrong-confident rate of 4%.
The shadow-mode setup
We run shadow mode for two weeks before any human grading begins. The agent sees every real inbound ticket and produces a draft reply, but the human team's actual response is what goes to the customer. This gives us a paired sample (same ticket, agent answer plus human answer), zero risk to live customers, and a natural baseline for the latency metric.
The scoring code itself is small. The point is that it is reproducible and that you run it weekly.
from dataclasses import dataclass
@dataclass
class Conversation:
ticket_id: str
agent_reply: str
human_reply: str
agent_latency_ms: int
closed_by_agent: bool
reopened_within_7d: bool
quality_agent: float # 1.0 to 5.0
quality_human: float
wrong_and_confident: bool
def benchmark(sample):
n = len(sample)
resolution = sum(
c.closed_by_agent and not c.reopened_within_7d for c in sample
) / n
quality_delta = (
sum(c.quality_agent for c in sample) / n
- sum(c.quality_human for c in sample) / n
)
p50_latency_s = sorted(c.agent_latency_ms for c in sample)[n // 2] / 1000
wrong_confident = sum(c.wrong_and_confident for c in sample) / n
return {
"resolution_rate": resolution,
"quality_delta_vs_human": quality_delta,
"p50_latency_seconds": p50_latency_s,
"wrong_confident_rate": wrong_confident,
"ship": (
resolution >= 0.70
and quality_delta >= -0.3
and p50_latency_s <= 8.0
and wrong_confident <= 0.02
),
}
Tune the thresholds to your domain. Pricing, medical, and legal questions deserve harsher wrong-confident gates. Order-status questions tolerate more.
Calibrating the human baseline
Before the agent runs at all, we benchmark the human support team for two weeks against the same four numbers. The baseline matters more than it should: most teams have never measured their own wrong-confident rate, and the answer is almost never zero.
Without a baseline, "78% resolution" is meaningless. It could be a strong agent on a hard inbox, or a weak agent on an easy one. With a baseline, you know which.
Retrieval is half the score
A clean benchmark setup is wasted if the agent cannot find the right document. The agent only knows what it can retrieve, and a sloppy retrieval pipeline shows up as both lower quality and higher wrong-confident.
Two practical rules. First, version your index by document hash, so an outdated PDF cannot quietly poison three weeks of answers. Second, separate retrieval recall from generation quality in your dashboard. When the agent gets worse, you want to know whether the model started hallucinating or whether the index started returning the wrong page.
The graders are the long pole
A clean rubric is the part of the project people underestimate. When a human grader disagrees with the LLM judge, we read the conversation together and amend the rubric example for the disputed score. After two weeks of this, the judge and the human agree on roughly 85% of samples, which is enough to trust the judge for bulk grading.
If the agreement rate drops below 80% mid-pilot, stop the dashboard, regrade the last week by hand, and find out what changed. Usually the rubric grew an edge case the judge never saw.
What ships, what does not
An agent ships when all four numbers clear the gate at the same time. Three out of four is not ship. Two out of four is not even close. The combinatorial nature is the point: an agent can be fast and helpful and still wrong, or accurate and slow and useless, or accurate and fast only because it escalates everything hard.
When we built the chat agent for a Dutch logistics client, the wrong-confident rate sat at 6% for the first three weeks because the retrieval index was returning the wrong revision of their shipping-rate PDF. We fixed it by versioning the index by document hash, scoring retrieval recall separately, and refusing to answer pricing questions when retrieval confidence dropped below a threshold. If you are building AI agents for a real inbox, the four numbers are the bar.
What you can do today: take 50 closed tickets from last week, score the replies on a 1-to-5 rubric, and count how many were wrong without hedging. That number is your starting line, and you cannot improve it until you have measured it.
Key takeaway
An agent ships when resolution, quality, latency, and wrong-confident rate all clear the gate at the same time. Three out of four is not ship.
FAQ
How long should shadow mode run before we judge an agent?
At least two weeks. You need a paired sample on the same tickets the human team answered, plus a baseline read on latency and escalation rate before any grading begins.
Can we use the same model as judge and writer?
No. Use a different model family for the judge, or it will rate its own answers too generously. Calibrate the judge against human spot-checks weekly.
What is an acceptable wrong-confident rate?
Under 2% for general support. For pricing, medical, or legal answers, push it under 0.5% or refuse to answer when retrieval confidence drops below threshold.
Why measure the human team first?
Because a number like 78% resolution is meaningless without a baseline. The human score tells you whether the agent is strong or whether the inbox is easy.