AI agents
Chat agent evaluation harnesses: a field guide for ops leads
After two front-page agent-mishap threads on Hacker News, every client asked us the same question: how do we know our chat agent didn't say something stupid last night?

It is Tuesday, 09:14. Marleen runs operations at a 30-person Dutch logistics firm. Her inbox has four tickets that all read variations of "the chat agent said something weird yesterday". One customer was quoted a delivery date that does not exist. One was told the office is open on a Sunday. One asked for a refund and got a recipe for stamppot. She does not have a data scientist on staff. She has a laptop, a coffee, and ninety minutes before the 10:30 standup.
This is the moment we built evaluation harnesses for.
The last fortnight made the case unavoidable. Two front-page Hacker News threads, weeks apart, described shipped agents behaving in ways their operators could not have predicted: one leaked system-prompt content under a novel jailbreak (a class of failure Simon Willison has been cataloguing for over two years), another burned a four-figure cloud bill in a runaway loop scanning a hobbyist network. None of this is surprising. All of it is operationally expensive. The lesson is not "stop using agents". The lesson is "run evals you can actually rerun on a Tuesday".
What follows is the field guide we now ship alongside every client-facing chat agent we deliver. Seven evaluation harnesses, ranked from "Marleen can rerun this in three minutes" to "Marleen can rerun this if she blocks an hour and reads our notes". All seven run on her laptop. None of them require a notebook, a GPU, or someone with the title of MLE. The order is deliberate: the cheaper a harness is to rerun, the higher it sits.
1. Golden conversation replay
This is the floor. Forty to eighty real conversations from the last month, captured with consent, replayed end-to-end against the agent on every model upgrade or prompt change. We diff the new transcript against the approved one and flag anything beyond a similarity threshold.
Marleen reruns this by typing one command:
npm run evals:golden
# walks /evals/golden/*.json, posts each turn,
# diffs assistant output against approved snapshot,
# prints a green/red table with the failing lines highlighted.
If the diff is green, ship. If it is red on a conversation Marleen recognises, she opens the failing snapshot side by side and decides whether the new behaviour is better, worse, or a wash. This catches roughly 70% of regressions in our portfolio.
The hard part is picking the forty conversations. We bias the set toward edge cases: the angry customer, the customer who switched languages mid-thread, the customer who asked the agent to do something it cannot do, the customer who pasted three paragraphs from a competitor's website into the chat. A golden set of forty "good morning, what are your hours" conversations tells you nothing. We rebuild this set every quarter from the prior quarter's tickets, and we retire any conversation whose underlying policy has since changed.
2. PII and policy regex sweep
Every agent reply for the last seven days, piped through a battery of regexes. Dutch BSN format. IBAN. Email and phone. Mention of competitor brand names. The three words a senior at the client has personally flagged as "we never say that".
npm run evals:pii -- --since=7d
# scans logs/agent-replies/*.jsonl
# matches against /evals/regex/*.yaml
# writes a CSV of hits to /reports/pii-YYYY-MM-DD.csv
The CSV opens in Excel. Marleen scans it. If there are zero rows, she closes it. If there is one row, she opens the conversation and decides whether the agent leaked something or the customer pasted it themselves. We run this nightly in cron, but the value of a Tuesday rerun is that Marleen sees the same view we do.
The regex list is collaborative. Every time the client says "we should never say X" in a Slack thread, X becomes a new pattern. After six months a typical list has 40 to 60 entries and reads like a quiet record of the team's actual brand discipline.
3. Tool-call audit diff
Agents that can do things (book a slot, refund an order, send a quote) leave a trail of tool calls. We log the JSON of every tool call alongside the conversation that triggered it. The harness replays the conversation in dry-run mode and diffs the new tool-call sequence against the original.
npm run evals:tools -- --since=7d --dry-run
# replays each conversation against the live agent
# with the tool runtime swapped for a recorder
# diffs the new tool-call JSON against the logged one
What this catches: the agent that used to ask one clarifying question before booking now books on the first turn. The agent that used to refund only on a specific reason code now refunds on any complaint. These are the failures that cost money before anyone notices.
The dry-run runtime must return realistic responses, not nulls. An agent that gets null back from check_inventory will behave nothing like the production agent. We keep a snapshot of recent tool responses and replay those. Skip this and your eval lies to you.
4. Cost and latency ceiling
For each of the golden conversations, we record p50 and p95 tokens-in, tokens-out, and wall-clock. The harness fails the build if the new model or prompt exceeds the ceiling by more than 20%.
npm run evals:budget
# runs the golden set, records token + latency stats,
# compares against /evals/budget.json,
# exits 1 if any ceiling exceeded by >20%.
The 20% threshold is not magic. It is wide enough that normal variation in retrieval-context size does not trip it, and tight enough that a behavioural change of any size will. We started at 10% and got tired of investigating noise. We tried 30% and missed a regression that doubled the per-conversation cost on a long tail of edge cases. 20% has been the right setting for two years.
This is the harness that would have caught the runaway-bill incident before it cost real money. Not because it would have stopped a loop in production (it would not have), but because it would have made the operator notice that the new prompt was triggering ten tool calls where the old one triggered two. That is the upstream signal. A blown ceiling is almost never an isolated cost story. It is a behaviour story with a price tag attached.
5. Knowledge-base grounding check
For agents with a RAG layer, every factual claim in the reply should be traceable to a retrieved chunk. We extract noun phrases from the assistant turn, look them up in the retrieved context, and flag claims that have no source.
npm run evals:grounding -- --since=7d
# for each reply, runs an extractor over the assistant text,
# checks each extracted claim against the retrieval context,
# writes /reports/ungrounded-YYYY-MM-DD.csv
Marleen does not need to read every claim. She reads the rows where the score is below 0.6. In a typical week that is maybe 12 rows. Half of them are harmless ("our team is here to help" does not need a source). The other half are the agent confidently inventing things. This is the harness Dutch operations leads find most valuable, because invented facts in Dutch sound exactly as fluent as real ones, and a Dutch customer will quote the agent back to you in writing.
6. Prompt-injection battery
A curated suite of about 200 known injection attempts, drawn from the OWASP LLM Top 10 playbook, public CTF write-ups, and the genuinely creative ones our own customers have tried. We rerun the full suite against the agent on every prompt change.
npm run evals:injections
# posts each attack payload as a user turn,
# checks the response against a pass/fail rubric,
# prints a category breakdown (system-prompt leak,
# tool-misuse, refusal bypass, exfiltration, etc).
Every prompt change, every model change, gets the full battery. Marleen reruns it before she ships any prompt edit. It takes about 11 minutes for 200 payloads against a streaming agent.
One nuance: the battery is not a pass/fail score for the model. It is a regression test for your specific system prompt and tool surface. The payload that was safe last week may not be safe this week, because the model changed underneath you. That is the whole point. Vendors do not always announce when they retrain a checkpoint, and the public benchmarks they cite rarely cover your exact threat model.
7. Tone and persona drift
This one needs the most judgement, which is why it is last. The agent has a defined voice (we write a one-page voice constitution with every client). We sample 20 recent replies, feed them through a small LLM judge with the constitution in context, and ask: does this reply sound like the brand?
npm run evals:voice -- --sample=20
# samples replies from the last 7 days,
# scores each one against /evals/voice-constitution.md,
# writes a CSV with score, rationale, and reply text.
Scores under 7/10 get a read. The rationale is usually correct. The replies that fail tend to cluster: overly formal Dutch on a brand that speaks plain Dutch, English filler words on a Dutch-only brand, apologies where the customer did nothing wrong. This is the harness that surfaces the slow drift you cannot feel day to day.
The seven in order of effort
The order matters. The first five Marleen can run as a single make target before lunch. Six and seven are weekly, run by us with a Slack-channel summary going back to the client. The point of the ordering is not that the later ones matter less, it is that the earlier ones are the ones an operations lead will actually keep running. An eval that does not get rerun is not an eval. It is a one-off.
If you build only one of these this week, build the golden replay. If you build two, add the PII sweep. If you build three, add the tool-call diff. Those three catch most of the failures we see in the wild, and all three fit on a laptop.
When we built the support agent for a Dutch facility-management firm last quarter, the tool-call diff is what caught a four-figure problem before it shipped: a prompt tweak meant to be more polite ended up dispatching a technician on every "thanks, bye". We wrote up the wider field notes on the way we ship and supervise AI agents after that one.
The smallest thing you can do today: open a terminal, create /evals/golden/, paste in five real conversations from your agent's logs, and write the loop that replays them. Twenty minutes of work. Every model upgrade after that gets cheaper.
Key takeaway
An eval that does not get rerun on a Tuesday is not an eval. It is a one-off. Build the five that run on a laptop before you build anything else.
FAQ
Do I need a data scientist to run these evaluation harnesses?
No. The first five run as command-line scripts against logs you already have. The two that need judgement (injections, voice drift) are still readable by an ops lead with no ML background.
How often should I rerun the golden conversation replay?
Before any prompt change, any model upgrade, and on a nightly cron. The cost is small and the regressions it catches are not.
What if I don't have logged conversations yet to build a golden set?
Start with synthetic ones written by your support team. Replace them with real captured conversations (with consent) as soon as you have a week of production traffic.
Why rank the harnesses by effort instead of by importance?
Because importance does not predict whether the harness gets rerun. Effort does. The harness that runs every Tuesday catches more than the one that runs once.