RAG

RAG audit checklist: stale chunks, PDFs, lying rerank

Before a knowledge base talks to a paying customer, three failures sink it more often than the model: stale chunks, orphaned PDFs, and a confident lying rerank.

Jacob Molkenboer· Founder · A Brand New Company· 10 May 2024· 6 min

Half-open oak index-card drawer on ivory paper, one card with chartreuse tab, ledger papers and brass divider beside it.

It is Tuesday morning. A customer asks your RAG-powered support agent when she can return a product she bought last week. The agent pulls a chunk that looks pristine. Right headings, right tone, right brand voice. It quotes a 30-day return window. You changed that window to 14 days in January. The new policy is in the knowledge base too. The retriever found both. The rerank picked the older one because it was longer, denser, and matched more keywords.

This is the failure mode we see most often when we audit a RAG knowledge base for a client. The model is fine. The retrieval is the problem, and three specific failure modes cause most of it.

What follows is the checklist we run before we let a RAG system talk to a paying customer. It takes about a day on a small base and a week on a large one. We have never finished it without finding something.

Stale chunks that look fresh

The first pass is the obvious one. We list every chunk in the index and sort by ingestion date. Anything older than 18 months gets flagged. So does anything whose source URL or file path no longer resolves.

The harder pass is the chunk that is technically current but factually retired. A pricing page that still exists at the old URL because marketing kept the page up "for SEO". A staff bio for someone who left in April. A product spec for a SKU you discontinued. None of those will fail a freshness check by timestamp. All of them will be confidently quoted to a customer.

What we do: we ask the client's operations lead to walk us through a list of the top 200 chunks by retrieval frequency over the last 30 days. Not the top 200 by score, the top 200 by actual retrieval count. About 10 to 15 percent will be flagged as "we don't quote that anymore". Those chunks need to come out of the index, not just get marked deprecated. The retriever does not care about your metadata flags if you do not also filter on them.

Warning

A "deprecated" flag in metadata is not a filter. If your retriever is not actually applying where deprecated = false at query time, the chunk still wins.

Orphaned PDFs

Every knowledge base over a year old has them. A PDF that was uploaded once, chunked, embedded, and then forgotten. The PDF has been deleted from the source system (Notion, SharePoint, Drive) but the chunks are still in the vector store. We have seen indexes where roughly 8 percent of chunks point to source documents that no longer exist on the system that supposedly owns them.

This is worse than a stale chunk because there is no way for a human to verify the source. The agent quotes the PDF. The customer asks for the document. The link 404s. Trust collapses on the spot.

The audit step is mechanical. For every unique source URL or file ID in the index, issue a HEAD request or the equivalent API call. Log the ones that fail. Cross-reference against the source system's authoritative list of live documents. The delta is your orphan set. Delete those chunks. Do not "mark for review". Delete them.

A related class of orphan: the PDF that exists but should not be in the customer-facing index at all. Internal salary bands. A draft press release. A consultant's NDA-only diagnostic. We have found all three in customer-facing indexes. The audit asks one question of every source document: would you be comfortable if your worst-behaved customer pasted this verbatim into a tweet? If not, it does not belong in the index.

The rerank that lies

This is the section that surprises clients the most. The retriever returns 50 candidates. A rerank model scores them. The top 5 go to the language model. Everyone assumes the rerank is the safety net. It is the most common single cause of confidently wrong answers.

Rerank models are trained on general relevance signals. They reward documents that look like answers: structured, dense, on-topic vocabulary. A long, well-formatted page from 2023 will outscore a short, plain announcement from last month, even when the announcement is the authoritative update. Cohere's rerank documentation is explicit about this. Rerank optimises for semantic relevance, not recency, not authority, not policy status. That is your job to layer on.

There is also the well-documented position bias in long contexts. The Lost in the Middle paper showed that even when the right document is retrieved, models tend to ignore it if it lands in the middle of the prompt. The rerank decides what makes the cut. If it puts the correct doc at rank 3 and the wrong doc at rank 1, the model often follows rank 1.

The audit step here is the most useful one in the whole checklist. Pull a sample of 200 real customer questions from the last 60 days. For each, log the top 5 reranked chunks and the top 5 by raw vector similarity. Have a human (the client's domain expert, not us) mark which chunks are actually correct. We typically see the rerank disagree with the domain expert on 15 to 25 percent of queries.

Then look at the disagreements. The pattern is almost always the same: the rerank prefers the longer, older, more polished chunk. The fix is not "train a better rerank". The fix is to layer hard filters before rerank runs. Effective date ranges. Document type whitelists for specific intents. A "current policy" flag enforced at retrieval time, not metadata-only.

A worked example

Here is the kind of filter we end up adding. The retriever returns candidates. Before rerank, we drop anything that fails a hard policy check.

from datetime import date

def pre_rerank_filter(candidates, query_intent):
    today = date.today()
    out = []
    for c in candidates:
        if c.meta.get("deprecated"):
            continue
        eff_from = c.meta.get("effective_from")
        eff_to = c.meta.get("effective_to")
        if eff_from and eff_from > today:
            continue
        if eff_to and eff_to < today:
            continue
        if query_intent == "policy" and c.meta.get("doc_type") != "policy":
            continue
        out.append(c)
    return out

That is fourteen lines. It is also the difference between a system that quotes January's policy in June and one that does not. We have never shipped a customer-grade RAG setup that did not need a version of this.

What to actually run today

The full audit is a week. The five-minute version is this. Open your vector store. Run a query that counts chunks grouped by ingestion month. If the oldest tail is more than 18 months back and nobody on your team can name what is in it, you have a stale-chunk problem. Then pick three URLs at random from the source field. Open them in a browser. If any of them 404, you have an orphan problem. Then run a single real customer question through your retriever twice: once with the rerank, once without. If the answers disagree, the rerank is making a call somebody on your team should be making.

When we built the support agent for one of our larger e-commerce clients, the rerank disagreement rate was 22 percent before we added the pre-rerank policy filter. After, it sat at 3 percent, and the three were edge cases the domain expert wanted to review anyway. The same playbook drives how we ship AI agents: retrieval first, model second, audit before launch.

Run the five-minute version today. If anything fails, you know where to look next.

Key takeaway

Before letting a RAG system answer a customer, audit three things: stale chunks, orphaned sources, and rerank disagreement against a domain expert.

FAQ

How often should I re-run a RAG audit?

Quarterly for active knowledge bases, monthly if your source documents change weekly. Run the five-minute spot check any time you push a material policy or product change.

Can I trust the rerank score as a confidence signal?

No. Rerank scores measure semantic relevance, not factual accuracy or recency. Apply hard filters for effective dates, document type, and deprecation before rerank runs, not after.

What counts as a stale chunk?

A chunk that was correct at ingestion but no longer reflects current reality. Old pricing, retired policies, discontinued products. Timestamp checks miss them because the metadata still reads as live.

Should I delete orphaned PDFs or mark them deprecated?

Delete them from the index. A deprecated flag is not a filter unless your retriever explicitly enforces it at query time, and most pipelines do not by default.

ragknowledge baseai agentsarchitectureoperationsautomation

Building something?

Start a project