RAG
RAG vs file search: scoring two knowledge-agent stacks
A 60-person Dutch consultancy ships a knowledge agent that confidently cites a Confluence page deprecated in 2024. Now you have to decide what to do.

Tuesday morning, Utrecht. A 60-person tax consultancy. The managing partner forwards me a screenshot. Their internal knowledge agent told a junior associate that the BTW treatment for a specific cross-border invoice scenario was X. The Confluence page it cited was archived in 2024, and the rule has changed twice since.
The partner wants to know whether the agent is salvageable or whether they should rip it out.
This is not a rare email. It lands in our inbox roughly once a month, almost always from a sub-€12M Dutch professional-services firm that built (or paid for) an internal RAG-flavoured assistant in 2025 and is now living with the consequences. The question they actually want answered is binary: vector store, or no vector store. Closed-corpus RAG, or a file-search tool with the model doing the routing.
Here is the method we use to score that decision.
The two architectures on the table
Option A is what most teams already have. A scheduled job pulls pages out of Confluence and SharePoint, chunks them, embeds them into a vector store (pgvector, Pinecone, Qdrant, take your pick), and the agent retrieves the top-k chunks before answering. Chunking rules, reranker, query rewriter, the whole pipeline.
Option B is newer and looks deceptively simple. You expose Confluence and SharePoint to the model as a file-search tool (or as MCP servers), and let the model issue search queries directly against the source systems, then read the documents it finds. No vector store. No nightly index job. No chunking decisions in your repo.
Both can answer the same question. They fail in different ways. That is the entire point of the scoring sheet.
Hallucination rate, measured honestly
You cannot trust the vendor demo here. The hallucination rate of a knowledge agent is dominated by two things: whether the retrieved context actually contains the answer (recall), and whether the model invents details when it does not (faithfulness).
Closed-corpus RAG: recall is bounded by your chunking and your reranker. If a Confluence page was edited yesterday and your indexer runs nightly, the agent retrieves a chunk that contradicts the live page. Faithfulness is usually good if you instruct the model to cite chunks and refuse on no-evidence.
File-search routing: recall depends on the search quality of Confluence and SharePoint, which is honestly mediocre. The model issues two or three queries, reads what comes back, and stops. If the answer sits on page seven of the SharePoint result list, it will not get there. The official Atlassian Confluence search API is fine for direct title matches and surprisingly weak on semantic phrasing.
In our scoring sheet, we run a fixed eval set of fifty questions from the client's actual ticket history, against both architectures, and we count three categories: correct with source, refused with reason, wrong (hallucinated or stale). Anything above a 5% wrong rate fails the gate.
import csv, json
from pathlib import Path
from agent_clients import rag_agent, file_search_agent
EVAL_SET = Path("eval/questions.csv") # cols: question, ground_truth_url
def score(agent, name):
rows = []
for row in csv.DictReader(open(EVAL_SET)):
out = agent.ask(row["question"])
# out = { "answer": str, "citations": [url], "refused": bool }
verdict = (
"refused" if out["refused"]
else "correct" if row["ground_truth_url"] in out["citations"]
else "wrong"
)
rows.append({**row, "agent": name, **out, "verdict": verdict})
return rows
results = score(rag_agent, "rag") + score(file_search_agent, "file_search")
Path("eval/out.json").write_text(json.dumps(results, indent=2))
This is not glamorous. It is the only number that matters.
Refresh latency in practice
How fast does a change to a Confluence page show up in the agent's answer?
Closed-corpus RAG: equal to your reindex cadence. Most teams run nightly because reindexing a 40k-page Confluence space hourly is wasteful. Some ship a webhook-driven incremental indexer; in practice it breaks within ninety days because somebody moves a space and the webhook payloads change shape. The honest field number is twenty-four to seventy-two hours, depending on whether the nightly job stays green.
File-search routing: real-time. The model queries the live system. The trade-off is that every answer pays the Confluence search-API latency, plus the model reading multiple full pages. We see end-to-end response times of nine to fourteen seconds for a non-trivial answer, against two to four seconds for the RAG pipeline.
For a tax consultancy where the rule changed twice this week, real-time wins. For a marketing agency answering brand-guideline questions that change once a quarter, the speed of RAG wins. Match the architecture to how fast your truth moves.
Maintenance ownership past month fourteen
This is the question that gets under-weighted in the architecture meeting and then dominates year two.
Closed-corpus RAG needs an owner. Chunk size, overlap, the choice of embedder, the reranker threshold, the prompt the retriever sends to the LLM, all of those drift. Six months in, somebody on the client's side has to be able to say: "the agent is missing tables in the technical SOPs because we chunked by token count and the tables span chunks; let's switch to header-aware chunking for the engineering space". If that person is not on payroll, the agent quietly rots.
File-search routing has almost no chunking knobs. You write a system prompt, you scope which spaces or sites the tool can search, you tune the refusal behaviour. That is roughly it. The model does the chunk-equivalent work at read time. The maintenance cost shifts from "data engineer who understands embeddings" to "policy person who understands which folders the agent should and should not see".
That second shape is usually staffable inside a sub-€12M firm. The first one is not. We have seen exactly two clients of that size successfully retain a person who could keep a custom RAG pipeline healthy past month fourteen. Both were software companies.
The scoping point is also where the recent agent-going-rogue stories matter. When an autonomous agent in the wild starts touching files it should not, the root cause is almost always permission breadth, not model behaviour. A file-search agent with a tight folder allow-list is easier to bound than a vector store that already ingested everything six months ago.
The scoring sheet that fits on one page
We score the two architectures on seven rows, each out of five, weighted to taste:
RAG File-search
Hallucination rate (eval) 3 4
Refresh latency 2 5
Year-2 maintenance load 2 4
Cost per 1k answers 5 3
Source-system search quality 5 2
Auditability (which page) 5 4
Vendor data-retention risk 4 2
The vendor-retention row is where current model-provider policy matters. Default retention windows vary by vendor and by model tier; some keep prompts and responses on their side for a window measured in days before deletion. Read your provider's policy before you sign: see for instance anthropic.com/legal. For Dutch firms whose own DPA promises seven-day deletion to their clients, this is not a deal-breaker, but it is a paperwork item to brief the DPO on. The same row, scored against an on-prem embedding model behind your own RAG pipeline, looks better.
If the weighted score is within two points, the right answer is almost always file-search routing, because year-two maintenance is the silent killer. If RAG wins by four or more, you have a real custom-pipeline use case: high volume, narrow corpus, strict cost ceiling.
When file-search routing actually loses
It is fashionable to argue that vector stores are obsolete. We do not believe that. File-search routing loses cleanly in three cases.
The corpus is not searchable through a native API. If your knowledge lives in a custom PHP intranet, a Joomla site, or a folder of 11,000 PDFs on a NAS, there is no search tool to route to. You will be building one anyway, at which point you are halfway to a vector store.
The volume is high enough that per-query token cost dominates. File-search routing reads more tokens per answer. At a thousand answers a day, that adds up. RAG with a small reranker beats it on unit economics.
You need deterministic citations for compliance. RAG with strict citation enforcement makes "which chunk did you use" trivially auditable. File-search agents will sometimes paraphrase across three documents and lose the precise paragraph.
For most sub-€12M services firms, the year-two maintenance question dominates everything else on the scoring sheet. Score it first, then argue about embeddings.
What we did for one Utrecht tax advisory
When we built the knowledge agent for that tax advisory last quarter, the partner asked us for RAG. We ran the eval. File-search routing scored 4/5 on hallucination rate against their own ticket archive; the planned RAG pipeline scored 3/5 and was going to need a half-time data engineer to keep alive. We shipped file-search routing with a tight space allow-list, billed less, and they stopped getting screenshots forwarded at 8am. If you want to see the scoring sheet on real numbers, the work lives under our AI agents practice.
The smallest thing you can do today: pull fifty real questions from your own ticket history, paste them into a spreadsheet, and mark for each one whether the answer changed in the last week. If more than ten did, your refresh-latency budget is hours, not days. That single number kills half the architecture meeting before it starts.
Key takeaway
Match the knowledge-agent stack to how fast your truth moves and who can maintain it in month fourteen, not to the vendor demo.
FAQ
When is closed-corpus RAG the right call for a sub-€12M firm?
When your corpus has no good native search, query volume is high enough that token cost dominates, and you have a data engineer who can keep the pipeline healthy past month fourteen.
Does file-search routing skip embeddings entirely?
Yes. The model issues live queries against the source system's native search and reads the documents it gets back. No vector store, no nightly index job, no chunking rules to maintain.
How do I measure hallucination rate without trusting a vendor demo?
Build a fifty-question eval set from your own ticket archive, run both architectures against it, and count correct-with-source, refused-with-reason, and wrong. Above 5% wrong fails the gate.
What about year-two maintenance?
RAG needs a chunking and embedding owner. File-search routing needs a folder-permissions owner. Most sub-€12M firms can staff the second role, not the first.