RAG

Sizing a RAG knowledge base: a three-axis pre-quote score

Before we quote a RAG build we score three axes: doc churn, the recall target the client really needs, and how many experts can validate a fifty-question regression set.

Jacob Molkenboer· Founder · A Brand New Company· 1 Nov 2024· 6 min

Open oak index-card drawer with chartreuse tab divider, three cream paper stacks and brass divider on ivory surface.

The quote we got wrong

A logistics company asked us last spring for a RAG system over their internal ops docs. Six weeks, we said. Three engineers, fixed scope, including evaluation. The number was wrong by a factor of two, and the reason wasn't the embedding model or the vector store. It was that nobody at the client could sit down for two days and validate whether the answers were actually right.

That project shipped. It took eleven weeks. We absorbed the overage. Since then we run a small worksheet before we quote any RAG build, and it has stopped us walking into the same trap.

Three axes that actually move the price

The interesting variables in a RAG quote are not "how many documents" or "which embedding model". Those are inputs to a calculator. The variables that move the price are three, and we score each on a 1 to 5 scale.

D, document churn. How fast the source corpus changes. A static knowledge base is cheap. A folder of PDFs that gets a new revision every three days is not.
R, recall target. What the answer needs to be worth. Internal "where do I find X" tolerates recall@5 around 0.80. A compliance answer read to a regulator does not.
E, expert validation capacity. How many hours per week the client can put a domain expert in front of the answers. This is the silent killer of RAG projects and the one we ignored on that logistics build.

We call it the DRE score. We run it before the second meeting.

Scoring document churn

You don't need a survey. If the client has a docs folder under git, this gives you a defensible number in thirty seconds:

git -C ./client-docs log \
  --pretty=format: --name-only --since="6 months ago" \
  | grep -v '^$' | sort | uniq -c | sort -rn | head -20

If they live in Confluence or Notion, pull the page-modified date on the top 200 most-trafficked pages. The question we want answered is: in a typical month, what percentage of the indexed corpus changes? Under 2% is a D=1 (static). Over 25% is a D=5 (live document).

The D score doesn't only affect re-embedding cost. It affects how fast the regression set goes stale, how aggressively you cache, and whether you need an incremental indexer or can get away with a nightly full rebuild.

Setting a recall target you can defend

"We want it accurate" is not a recall target. The number we negotiate is recall@k on a held-out question set, measured with a framework like RAGAS or a hand-rolled equivalent. The conversation goes like this:

What does the user do with the answer? Skim it, paste it into a reply, or include it in a contract?

Each of those has a different acceptable miss rate. Internal triage is fine at recall@5 of 0.80. Customer-facing answers we push to 0.90. Compliance and contract language we won't sign off below 0.95, and even then only with a citation gate that refuses to answer when retrieval confidence drops.

R is the recall target multiplied by five, rounded. R=4 means recall@5 of 0.80. R=5 means 0.95. A higher R buys more time on retrieval tuning: hybrid search, contextual chunking, re-ranking. Anthropic's writeup on contextual retrieval is a good baseline for what's reachable without exotic tooling.

The expert bottleneck

This is the axis nobody asks about. A RAG system is only as good as the regression set it's evaluated against, and a regression set is only as good as the experts who wrote it. If the client has one domain expert who can spare three hours a week, you have a ceiling. Two hours per question is realistic for a high-stakes set, including the back-and-forth on what counts as a correct answer.

So we ask, before we quote: how many hours per week can a person who actually knows the right answers sit with us, for the next four weeks, and then on a monthly retainer? The E score:

E=1: nobody, or only the founder, with under an hour a week.
E=3: one expert, two to four hours a week.
E=5: a small group, six or more combined hours a week, with at least two who can disagree productively.

If E is 1, the project ceiling is "demo that looks good in a meeting". We tell the client this. Sometimes that is all they need and the quote is small. Sometimes it changes the staffing on their side and the quote gets honest.

Warning

If E=1 and R=5, the project is not deliverable on any budget. You can build the system. You cannot prove it works. Walk away, or downscope R before you sign.

Why fifty questions, not five hundred

The fifty in "fifty-question regression set" is not arbitrary. In our experience it is roughly the smallest set where the recall number stops swinging wildly per question added. Under thirty, one badly chunked source moves your headline by several points. Over eighty, the marginal question rarely teaches you anything new and the expert is tired.

The runner is small enough to drop into the statement of work:

import json, statistics
from pathlib import Path

def hits_at_k(expected_ids, retrieved_ids, k=5):
    return int(bool(set(expected_ids) & set(retrieved_ids[:k])))

questions = json.loads(Path("regression.json").read_text())
scores = [
    hits_at_k(q["expected_doc_ids"], my_retriever(q["question"]))
    for q in questions
]
print(f"recall@5: {statistics.mean(scores):.2%} "
      f"over {len(scores)} questions")

We hand the client this file on day one. The regression set belongs to them, not to us. When we hand off, they should be able to rerun it without our help. If they cannot, the project has not shipped.

The DRE worksheet in practice

D + R + E gives a score between 3 and 15. The tiers we quote against:

3 to 6, light. A two-week prototype, a fixed weekly index, ship and walk away. Small fixed price.
7 to 10, standard. Hybrid retrieval, a re-ranker, the regression set, four to six weeks. Mid-tier fixed price with a defined evaluation milestone.
11 to 13, hard. Incremental indexing, citation gates, weekly expert sessions, eight to twelve weeks. Time and materials, with a recall floor written into the contract.
14 to 15, R&D. We discuss whether the project should exist in its current shape. Usually it shouldn't.

The scorecard is not magic. It is a way to have the same conversation with every prospect before money changes hands, and to refuse work we cannot deliver.

What changes when you put this in the sales call

Three things, in our experience. Quotes come back with detail the prospect can argue with, which is the conversation you want. Projects that should not exist die earlier and cheaper. And the client's domain expert becomes a named role from week one, not a surprise dependency in week six.

When we built the support-answer agent for a Dutch insurance broker last year, the thing we missed in scoping was exactly E. Two underwriters, plenty of corpus, no calendar. We ended up running the regression sessions on Friday afternoons across six weeks, and added a citation gate that refused to answer when retrieval confidence fell below a threshold. It is now the pattern we apply to every AI agent build that touches a real knowledge base.

If you have a RAG quote on your desk this week, the smallest useful thing you can do is ask the client one question: who, by name, will sit with us for two hours a week to score the answers? If they cannot tell you, you do not have a quote yet.

Key takeaway

The constraint that wrecks RAG quotes isn't model choice or vector store: it's how many hours a real domain expert can spend scoring answers.

FAQ

What's a reasonable recall target for an internal RAG system?

For internal triage, recall@5 around 0.80 is workable. Customer-facing answers we push to 0.90. Compliance answers shouldn't ship below 0.95, and only with a citation gate that refuses low-confidence retrievals.

Why fifty questions in the regression set?

Below thirty, one badly chunked source swings the headline number by several points. Above eighty, the marginal question rarely teaches anything new and the domain expert burns out before the project ships.

Can we skip the expert if the corpus is well-structured?

No. Structure helps retrieval, not evaluation. Someone has to decide what 'correct answer' means for each question, and that someone is a domain expert, not the engineer who built the index.

What if document churn is very high?

Score it D=5 and plan for incremental indexing plus a regression set refreshed monthly. Quote the refresh as a retainer line, not a one-off, so the system stays honest after handover.

ragknowledge baseai agentsstrategyoperationsarchitecture

Building something?

Start a project