RAG

Citation-first RAG: 22,000 NHG aktes, one safe write path

A 24-person mortgage-advisor network in Breda runs 1,560 weekly questions through a RAG agent that refuses to answer until every passage points back to a vetted source.

Jacob Molkenboer· Founder · A Brand New Company· 17 Jun 2026· 8 min

Open wooden index-card drawer with ledger paper, ink ribbon, brass divider, green tab, red wax seal on ivory.

On a Tuesday in March, an advisor at a Breda mortgage office had a klant on the line asking whether a self-employed bouwvakker with two years of jaarcijfers could still pass NHG toetsing under the 2026 voorwaarden. She had thirty seconds before the silence got awkward. The relevant passage was somewhere in a NHG-leidraad PDF, somewhere in a Stater LMS export, and somewhere in a memo the compliance lead wrote in December. Her actual workflow was three Chrome tabs and a gut feeling.

That office is one of 24 in the network. Together they handle around 22,000 actieve aktes and they process the kind of questions that, if answered wrong, end up at the AFM. They had asked us to build something like ChatGPT for hun aktes, but with one hard guarantee: no answer ever reaches the klantadvies-portaal write path without a citation pointing back to a vetted source. That last clause is what made the project interesting.

The shape of the problem

The brief was specific. About 1,560 hypotheek-vragen per week (we measured for a month before scoping). Five categories: NHG-grenzen, energielabel-eisen, tweede-woning rules, BKR-quirks, en oversluiting. Two underlying sources of regulatory truth: the NHG voorwaarden 2026 (PDF, 240 pages) and the AFM leidraad Hypothecaire kredietverlening. Two underlying data sources for klant-specifieke context: a Stater LMS that has been running since 2012, and a Sharepoint vol notitiebestanden going back six years.

The Stater LMS is the gotcha. If you have worked with Dutch mortgage backends you know the story: a stable, slow, SOAP-shaped backend that nobody wants to touch. We had read-only access to a replica, which is the right answer. The 22,000 aktes were exported in nightly batches and re-indexed into our own store. We never let the agent talk to Stater live.

Why naive RAG was the wrong starting point

The first instinct, and the one we explicitly avoided, was to chunk all the PDFs, embed them, throw them in a vector database, and call an LLM with top-k context. We did exactly that as a baseline. On 200 evaluator questions, it produced confident-sounding answers that were wrong in 17% of cases. Wrong as in: the answer matched a 2023 voorwaarden, not 2026. Wrong as in: the AFM leidraad was paraphrased, not quoted. Wrong as in: a passage from the Sharepoint notities was hallucinated outright.

For a klantadvies workflow that is regulator-facing, 17% is a project killer. The AFM's hypotheken supervision pages are explicit about advice traceability: an advisor has to be able to point at the basis of every recommendation. A black-box LLM that paraphrases is, in this sector, worse than no tool at all.

So we threw out the baseline and started from a different premise. The agent is not allowed to write a single token into the portaal field until a citation gate has passed.

The citation-first retrieval pipeline

The pipeline reads, roughly, like this:

user vraag
  -> intent classifier (5 categories + "other")
  -> source router (NHG | AFM | Stater | Sharepoint | klant-akte)
  -> retriever (hybrid: BM25 + dense, separate index per source)
  -> reranker (cross-encoder, top 8 -> top 3)
  -> citation validator (must match source hash + page + paragraph)
  -> answer composer (LLM, with strict template)
  -> write gate (rejects any answer without >=1 valid citation)
  -> klantadvies-portaal

Two design choices did most of the work.

Separate indexes per source

NHG voorwaarden, AFM leidraden, Stater akte-data, and Sharepoint notities each get their own index. A question about NHG-grens 2026 should never pull a passage from a six-year-old internal notitie. The intent classifier routes the query to one or two indexes only. This is not a performance trick. It is a regulatory boundary. An auditor can point at the routing logic and see that internal notes can never become the source of advice on hard NHG rules.

Citation validation before composition

The validator is the load-bearing piece. After the reranker picks its top three passages, we hash each passage against the original source document. If the hash drifts (because someone replaced a PDF, because an export got re-encoded), the passage is rejected. The agent then either retries with the next-best passage or returns "geen geverifieerde bron" instead of an answer.

The write gate, in code

The simplest version of the write gate looks like this. It is the last thing that runs before any token hits the klantadvies-portaal field:

def gate(answer: ComposedAnswer, citations: list[Citation]) -> WriteDecision:
    if not citations:
        return WriteDecision.refuse("no citations")

    valid = [c for c in citations if c.source_hash_matches()
             and c.is_from_allowed_source(answer.intent)
             and c.version_year >= 2026]

    if not valid:
        return WriteDecision.refuse("no valid citations after validation")

    # every claim sentence in the answer must map to >=1 cited passage
    unmapped = [s for s in answer.claim_sentences
                if not any(c.supports(s) for c in valid)]
    if unmapped:
        return WriteDecision.refuse(f"{len(unmapped)} unmapped claims")

    return WriteDecision.allow(citations=valid)

Nothing clever. The cleverness is in what we chose to refuse. If the answer cannot be fully grounded, the advisor sees a "geen advies, alleen bronnen" view: the retrieved passages, with page numbers, and a note that an answer is the advisor's call. That outcome happens about 6% of the time. Advisors hate it slightly less than they hate giving a wrong answer to a klant.

What we did with the Stater LMS

The 14-year-old Stater install was, at first, the scariest part of the project. In practice it was the easiest, because we did not touch it.

The pattern we used is one we have repeated on other legacy backends: read-only nightly export, hashed at the row level, indexed into our own store. The agent never reads from Stater at request time. There are two reasons. The first is operational: a slow upstream call inside a klant-facing flow is a customer service problem waiting to happen. The second is regulatory: if the export pipeline runs at 03:00 and the indexer logs every row's source hash, you have an audit trail that the live-call pattern does not give you for free.

For the Sharepoint notities we did something deliberately less aggressive. Those notities are not regulator sources. They are internal context. The agent is allowed to retrieve them, but the validator marks any answer grounded only in Sharepoint as "intern advies, niet voor klantcommunicatie." This is just a metadata flag, but it is the one piece of plumbing that made the compliance lead happy enough to sign off.

What 1,560 weekly questions actually looks like

The number sounds neat. The distribution is not. After three months of production traffic, the shape we saw:

About 38% of questions are about NHG-grenzen and energielabel-eisen. The agent answers these cleanly because the source documents are clean.
About 22% are about a klant-specifieke akte ("wat is de huidige rentevastperiode voor akte X"). These resolve from the Stater export.
About 18% are oversluit-vragen with both regulatory and akte-specific context. These are the most expensive in the pipeline because they pull from two indexes and run the validator twice.
About 16% are BKR and AFM-grenzen edge cases.
The remaining 6% are the "geen advies" refusals.

The accuracy numbers, measured against a panel of senior advisors as evaluators: 94% of non-refused answers were judged correct and adequately cited. 5% were correct but cited the wrong leidraad-paragraph (we fixed most of these by tightening the reranker prompt). 1% were wrong. Those 1% all came from the Sharepoint index, which is why we now flag those answers as internal-only by default.

A few things that did not work

Some honest reporting from the build.

We tried using a single combined index with metadata filters instead of separate indexes per source. It tested fine on 50 questions and fell apart at 500, because the dense retriever started cross-mixing passages with similar phrasing across the NHG and AFM corpora. Separate indexes were not just cleaner; they were measurably more accurate.

We tried caching answers at the question level. Mortgage advisors phrase the same question in twelve ways, and the naive cache hit-rate was about 3%. We replaced it with caching at the retrieved-passage level, which sits at 41% and saves real money.

We tried to skip the cross-encoder reranker because it adds 400ms. The accuracy drop without it was 6 percentage points. The advisors did not notice the 400ms. They did notice the wrong answers.

Warning

If you build RAG over regulator-facing documents and you do not version source files by hash, you will eventually serve an answer from a superseded leidraad. The agent will sound confident. The auditor will not be.

What the advisor sees

The interface is dull on purpose. A vraag-veld at the top. An answer beneath it. Under the answer, three citation cards: source name, document version, page number, the passage in full. A button to open the source PDF at the cited page. A button to copy the answer-plus-citations block into the klantadvies-portaal.

There is no chat interface. We tried one. Advisors did not want it. What they wanted was a fast lookup that produced text they could paste, with footnotes they could defend if their compliance lead asked.

The smallest thing you could do today

If you are looking at a RAG project in a regulated industry, write the citation validator before you write anything else. Define what a valid citation means for your domain (source whitelist, version constraint, hash check, paragraph mapping), and refuse to ship without it. Everything else (vector store, embeddings, model choice) is a tuning parameter. The validator is the architecture.

When we built the RAG agent for the Breda network, the thing we ran into was that the model was never the bottleneck. The bottleneck was deciding what we were willing to write into a klantadvies-portaal without a source. We solved it by making that decision explicit in code, before the LLM was allowed to compose a single token. The NHG voorwaarden are a public document; the answer your advisor pastes into a portaal becomes a record. Treat the second one with the same seriousness as the first.

Key takeaway

In regulated finance, RAG is not a retrieval problem. It is a citation-integrity problem. Build the write gate before you build the model call.

FAQ

Why use separate indexes per source instead of one combined vector store?

Cross-mixing between NHG voorwaarden and AFM leidraden happens at scale because the phrasing is similar. Separate indexes per source also give an auditor a clear regulatory boundary.

What happens when no valid citation exists for a question?

The agent refuses to write an answer. It shows the advisor the retrieved passages with page numbers and labels the result 'geen advies, alleen bronnen.' This happens about 6% of the time in production.

Why not connect the agent directly to the Stater LMS at request time?

Speed and audit trail. A nightly read-only export, hashed at the row level and indexed into our own store, is faster for the advisor and gives a verifiable record an auditor can read back.

How accurate is the agent in production?

94% of non-refused answers were judged correct and adequately cited by a panel of senior advisors. 5% cited the wrong leidraad-paragraph. 1% were wrong, almost all from the Sharepoint index, which is now flagged as internal-only.

ragai agentscase studyknowledge basearchitectureoperations

Building something?

Start a project