← Blog

RAG

RAG regression testing: how we A/B retrieval against itself

A method for catching silent regressions in a 38,000-document Dutch HR RAG pipeline: shadow traffic, a judge model, and a regression set we rebuild every Monday.

Jacob Molkenboer· Founder · A Brand New Company· 7 Jun 2026· 10 min
Open wooden index-card drawer on ivory paper, one green card raised, brass divider, red rubber band beside it.

Monday, 09:14. An HR lead at a Dutch logistics company pastes a question into the chat agent we built her: "Mag een uitzendkracht meedoen aan de bonusregeling van 2024?" The agent answers in two sentences and cites the right clause of the CAO. Six weeks earlier, the same question would have surfaced a 2022 memo and a confidently wrong yes. Nothing about the model changed in those six weeks. We swapped the retriever twice, the chunker once, and the reranker three times. The only reason we know the new answer is better, and not just different, is the rig this post is about.

We run a RAG pipeline on top of 38,000 Dutch HR documents: CAOs, internal policies, leave handbooks, a decade of email memos that nobody has the heart to delete. The corpus grows by roughly 200 documents a week. Every change to the retrieval stack, a new embedding model, a different chunk size, a reranker swap, a metadata filter, is a coin flip in production unless you can measure it. The trouble is that RAG quality is not a single number. A change can improve recall on policy questions and quietly destroy precision on payroll questions. You will not notice for a month, and by then three other things have changed too.

So we A/B the pipeline against itself. Here is the method, end to end.

Two pipelines, one query

The production agent has a stable version, call it champion, and one or more candidate versions, challenger-a, challenger-b. Every incoming user query is routed to the champion for the live answer. In the same request, asynchronously, the query is also fired at every active challenger. The challenger answers are never shown to the user. They are written to a comparison table with the champion answer, the retrieved chunks from each pipeline, the latency, and the token cost.

This is shadow traffic, which is the only honest way to evaluate a retriever. Synthetic questions skew toward what the engineer can imagine. Real users ask things engineers cannot imagine, in a register engineers do not write in. On the Dutch HR corpus, half the production queries contain at least one of: a typo, a code-switch into English mid-sentence, or a regional spelling of a legal term that the official documents do not use. None of that survives a hand-written test set.

async def handle_query(q: str, user_id: str):
    champion_task = asyncio.create_task(run_pipeline("champion", q))
    challenger_tasks = [
        asyncio.create_task(run_pipeline(name, q))
        for name in active_challengers()
    ]

    answer = await champion_task

    # Fire and forget. Never block the user on a challenger.
    asyncio.create_task(
        log_shadow_results(q, user_id, answer, challenger_tasks)
    )
    return answer

The pattern is unremarkable. The discipline is in what you do with the rows.

A judge model that does not know which pipeline is which

For each shadow row we score the champion answer and each challenger answer with a second model. The judge sees the question, the retrieved chunks, and the two answers, labelled A and B in randomised order. It does not know which is the live one. It does not know which retriever produced which chunks. It scores four axes on a 1–5 scale: groundedness (every claim traceable to a chunk), completeness, directness, and recency (did it cite the most recent applicable version of the policy). Then it picks a winner or declares a tie.

Two notes on the judge. First, position bias is real and large; without randomising A/B order, the judge will favour position A more often than chance. The original LLM-as-a-judge paper measures this directly and is worth twenty minutes. Second, the judge has to be a different model family from the one generating the answer, or the scores collapse into self-flattery. We use a smaller model for generation and a larger one for judging. The judge costs roughly one cent per comparison at our volume, which is the cheapest line item in the whole system.

Warning

Never use the same model for generation and judging. A model marking its own homework will tell you everything is fine, right up to the moment a customer escalates.

The Monday regression set

Shadow traffic tells you what is happening on live queries this week. It does not tell you whether the pipeline still answers the questions it used to answer correctly. For that we keep a regression set of exactly 200 questions, and we rebuild it every Monday morning at 06:00.

The rebuild is automated. A script samples from the previous week's shadow log under three constraints:

  • 80 questions where the champion scored 4 or 5 across the board last week. These are the wins we refuse to lose.
  • 60 questions where champion and challenger disagreed and the judge picked the champion. These are the cases the new retriever needs to keep matching.
  • 60 questions where the champion scored 2 or below. These are open wounds. We want the regression set to remember them.

The 80-60-60 split is not sacred, but the shape matters. Any regression set built only on what is hard today will lose the cases that became easy yesterday. Any set built only on past wins will not push the new retriever anywhere. The three buckets answer three different questions: are we still good on what we already won, are we matching last week's contested calls, are we making any dent in the open wounds. If you change one bucket you will probably break the others, so we treat the split as version-controlled and review it quarterly, not weekly.

Then a human, usually one of us, spends about an hour on Monday reviewing the 60 low-scoring rows, writing the correct answer by hand, and tagging the failure mode (wrong-version, missing-document, hallucinated-clause, wrong-language). This hour is non-negotiable. It is the only part of the loop that cannot be automated, and the whole rig loses meaning without it. The judge model can compare two answers, but it cannot tell you what the right answer is on a clause of Dutch labour law it has never seen.

On every deploy, every challenger runs the full 200-question set before it can be promoted. The pass bar is: no regression on any of the 80 wins, judge-preference rate above 55% on the 60 disagreements, and at least 20 of the 60 open wounds now scoring 4 or above. If a candidate fails any of the three, it does not ship. We do not negotiate with the numbers.

What the rig has actually caught

A reranker upgrade that improved average judge score by 0.3 points, and silently broke every question that depended on a table being retrieved as a single chunk. We only saw it because four of the 80 wins regressed to 2.

A chunk-size change from 800 to 1200 tokens that improved completeness on policy questions and destroyed directness on payroll questions, because the model started padding answers with surrounding context. The Lost-in-the-Middle effect bites earlier than most teams expect once chunks get long. The aggregate score barely moved. The per-tag breakdown was unambiguous.

An embedding model swap that scored beautifully on the regression set and then collapsed in shadow traffic the next day. The regression set was built from queries the old embedder handled well, so it had a survivorship bias against questions the old embedder fumbled. We now weight the regression set toward the long tail.

A metadata filter we added to scope queries to the user's department, which improved precision on department-specific questions and broke roughly forty queries where the right answer lived in a cross-department policy. The filter was correct on the head of the distribution and wrong on the long tail. The 60 open-wounds bucket caught it within two weeks, because half the new bottom-scorers came from one department asking about another department's parental-leave terms.

Takeaway

An aggregate quality score on a RAG pipeline is almost always lying to you. Break it down by question tag and by failure mode, or you are flying blind.

What the dashboard looks like

The HR lead does not see any of this. She sees an answer in two sentences with a citation. The operations team sees one page: a grid of champion versus each challenger, judge-preference rate over seven days, per-tag breakdowns, the latency delta, and the cost delta. Two numbers at the top: shadow agreement (how often the two pipelines reach the same answer at all) and judge preference (when they disagree, who wins).

The interesting column is the third one: questions the judge could not score. When that number ticks up, something has changed in the corpus that neither pipeline knows how to handle yet, usually a new policy document that contradicts an older one. That is a content problem, not a retrieval problem, and it goes to the HR team, not to us.

The calibration loop between judge and human

The judge is a tool, not an oracle. To stop it drifting we sample 20 of its decisions a week and re-score them by hand. We log agreement rate per axis (groundedness, completeness, directness, recency) and per failure-mode tag. When overall agreement drops below 80%, or when a single tag falls below 70%, we rewrite the relevant section of the judge prompt and re-run the previous month's shadow log to confirm the scores stay stable on cases we already settled.

Two patterns recur. Judges trained on English benchmarks under-penalise the wrong language tag, because they will happily accept a Dutch answer to an English question as long as the chunks support it. And judges almost never catch missing-document on their own, because they only see the chunks the retriever returned, not the ones it missed. The first we fix with prompt rewrites. The second we fix with a separate retrieval-coverage check that runs offline against an index of every document filed in the last twelve months.

What it costs and why we still run it

Roughly: every live query becomes 1 + N answers, where N is the number of active challengers, plus one judge call per challenger. With one challenger, that is 3x the inference cost on shadow-eligible traffic. We sample. About 15% of queries get shadowed on average, but the sampler is not flat across the day. During the 09:00 to 11:00 morning surge, when HR managers do their inbox triage, we drop the shadow rate to 5% so the challenger calls do not compete with the champion for the rate-limit budget. Between 14:00 and 17:00, when the corpus is quieter, we run shadow on roughly every other query. The Monday regression run is a fixed cost of around €40 in inference, plus the human hour.

For a chat agent that answers 4,000 HR questions a week and replaces roughly 12 hours of someone's time, the rig costs less per month than the coffee budget. The reason we keep it running is not the cost arithmetic. It is that without it, a senior engineer has to spend half a day every release convincing themselves the new version is at least as good as the old one, and they will be wrong about a third of the time. The wrong third is what the support team hears about two weeks later, in the worst possible framing.

When we built the Dutch HR agent for a logistics client, the thing we ran into was that the regression set we shipped on day one decayed into uselessness within four weeks; the corpus had moved on but the questions had not. We ended up solving it by rebuilding the set from live traffic every Monday, which is the only reason any of the numbers above mean anything six months later.

If you have a RAG system in production and no regression rig, the smallest useful thing you can do today is log the retrieved chunk IDs alongside every answer. Once you have a week of those logs, you can replay any query against any new retriever and see the diff. Everything in this post is built on top of that one column.

Key takeaway

Aggregate RAG scores lie. Shadow every query against a challenger, judge with a different model, and rebuild the regression set from real traffic every week.

FAQ

Why not just use a public RAG benchmark?

Public benchmarks do not match your corpus, your users' phrasing, or your failure modes. They are useful as a sanity check, not as a deploy gate. The signal you need lives in your own shadow traffic.

How big should the regression set be?

Big enough to cover your failure modes, small enough that a human can review the bottom slice in an hour. For us that is 200. Below 100 the variance is too high; above 500 it stops getting rebuilt.

Does the judge model need to be larger than the answer model?

Not strictly larger, but it must be a different model family, and ideally one with stronger reasoning. A same-family judge will rubber-stamp answers it would have produced itself.

What if the judge and the human disagree?

Track it. We sample 20 judge decisions a week and re-score them by hand. When agreement drops below 80% we rewrite the judge prompt. It is a calibration loop, not a fire-and-forget.

Can this work without shadow traffic, on a low-volume agent?

Yes, but slower. Replace shadow traffic with a hand-curated set that grows by 10 to 20 questions a week from real user logs. The Monday rebuild rhythm still applies.

ragai agentsknowledge basetoolingarchitectureoperations

Building something?

Start a project