RAG
RAG for a private bank: the playbook behind every cited answer
On the Meir at 09:47, a relatiebeheerder needs to answer a client about a 2019 autocallable. The KID was reissued last month. The FSMA reads logs. Now scale that to 25 people.

It is Tuesday, 09:47, on the Meir in Antwerpen. A relatiebeheerder opens an email from a long-time client who wants to know whether a 2019 autocallable still matches her risk profile, and whether the latest KID reflects the recent issuer downgrade. The fiche lives in Olympic Banking, last touched in 2019. The suitability report is a PDF in a shared drive. The KID has been re-versioned twice this year. The relatiebeheerder has 34,000 product fiches behind him, fourteen other clients in his inbox, and one rule from the regulator: no recommendation leaves this office without a citation a compliance officer can open.
That one email is one of roughly 1,180 we counted in a single week when we started the project. The bank has 25 people. The math does not work without help. A RAG system that can absorb a third of that mail volume, with citations a compliance officer trusts on the first read, is not a productivity upgrade. It is the reason this project got funded at all.
The compliance constraint that shaped every architectural choice
Before we wrote a line of code, we wrote one sentence on the whiteboard: no token reaches the cliënt-mail draft path without a verifiable citation to a vetted document. Everything else followed from that.
Under MiFID II, the recommendation chain has to be reconstructable months after the fact. The FSMA reads logs. A confident answer with no source is worse than no answer — it is a compliance event. So the system we built is not "search plus generation." It is a citation engine that occasionally writes prose.
The hard rule, baked into the runtime:
def gate(draft: Draft) -> Draft:
if not draft.citations:
raise NoCitationError("blocked before draft path")
for c in draft.citations:
if c.doc.status != "vetted":
raise UnvettedSourceError(c.doc.id)
if not draft.text_supports(c.span):
raise UngroundedClaimError(c.span)
return draft
If the gate raises, the draft does not reach the relatiebeheerder. It routes to a review queue with a one-line reason. We never want a confident hallucination to land in an Outlook draft folder where a tired hand will click send.
Inventory before vectors
The first six weeks were not embeddings, not prompts, not even Python. We walked the archive.
The bank ran on Olympic Banking, a custom Sybase-era core that had been extended since 2012. Fourteen years of XML exports, two Access front-ends nobody admitted to using, and a SharePoint that contained the same suitability report in four versions across three folders. Of the 34,000 product fiches, only 11,200 were live instruments. The remaining 22,800 were either matured, withdrawn, or duplicates from a 2017 migration that had never been finished.
We classified everything into five buckets:
- Vetted & live — KIDs, suitability reports, current fiches, product committee minutes. Indexable.
- Vetted & archived — last-known-good versions of withdrawn products. Indexable, flagged with a withdrawal date.
- Draft — anything the product committee had not signed. Quarantined.
- Duplicate — older copies of a still-live document. Removed from the index, retained for audit.
- Unknown provenance — Word files nobody could attribute to a committee. Excluded entirely.
About 19% of what the bank thought was reference material did not survive the audit. That percentage is the most important number in this post. If you skip it, every recall and precision improvement you make later is measuring the wrong thing.
Chunking by KID section, not by tokens
Most RAG tutorials chunk on 500-token windows with 50-token overlap. We did not. A KID has fixed sections under PRIIPs — "What is this product?", "What are the risks and what could I get in return?", "What are the costs?" — and a citation only matters if it lands inside the right section. We chunked on the section, never across it.
SECTIONS = [
"what_is_this_product",
"risks_and_return",
"costs",
"how_long_should_i_hold_it",
"complaints",
]
def chunk(doc: Document) -> list[Chunk]:
sections = parse_kid_sections(doc.pdf)
return [
Chunk(
id=f"{doc.id}#{name}",
doc_id=doc.id,
section=name,
text=text,
vetted_at=doc.vetted_at,
isin=doc.isin,
)
for name, text in sections.items()
if name in SECTIONS and text
]
One side effect we did not expect: this made the index a third smaller than naive chunking, and recall went up. A risk section never gets confused with a costs section because they are different rows in the database.
There is a second reason worth naming. KID sections evolve at different rates. The risks block updates when the product committee revisits volatility assumptions, often quarterly. The costs block moves only when the issuer reprices, which is rare. Token-chunking forces a full reindex on any edit. Section-chunking reindexes only the rows that actually moved, and the audit log shows which section was touched, by whom, on what date. That property paid for itself the first time compliance asked who had changed the risk wording on a specific ISIN.
Hybrid retrieval with a reranker
Dense vectors alone underperformed badly on ISIN codes, ticker mentions, and the shorthand a relatiebeheerder types into a query box. We added BM25 over the same chunks, fused the top 50 from each, and reranked the union with a cross-encoder.
def retrieve(query: str, k: int = 8) -> list[Chunk]:
bm25_hits = bm25.search(query, top_k=50)
dense_hits = vectors.search(embed(query), top_k=50)
fused = rrf([bm25_hits, dense_hits])
reranked = reranker.score(query, fused[:80])
return [c for c in reranked if c.score > 0.42][:k]
The 0.42 floor was not a guess. We held out 220 historical relatiebeheerder questions with known correct sources and walked the threshold until F1 plateaued. Below 0.42 we were pulling in adjacent products. Above 0.55 we were missing valid answers when the relatiebeheerder misspelled an instrument name.
Two languages, one index
A Belgian private bank serves clients in Dutch, French, and occasionally German. The product committee writes KIDs in Dutch first; a regulated French translation lands within ten business days. Two languages means two embedding spaces if you are not careful, and a relatiebeheerder asking a Walloon client's question in French should still retrieve the Dutch source if that is the only version signed off this week.
We use a multilingual embedding model with a shared space and tag every chunk with its language. The retrieval call carries the query language; the generation call carries the response language. If the only vetted chunk is in Dutch and the query is in French, the system retrieves the Dutch chunk, generates the answer in French, and cites the Dutch span. A footnote on the citation tells the compliance officer what happened, and which translation pair was used.
Cross-language retrieval was the feature that bought the most goodwill in month one. It was also the one most likely to fail silently when a model release moved the embedding space without anyone noticing. We pinned the embedder and added a regression test that asks a known French query against a Dutch KID and fails the build if the top-1 cited chunk changes.
Citation-first generation
The generation step is constrained. The model is given the retrieved spans and a system instruction that, simplified, says: state only facts that appear verbatim or near-verbatim in the spans below; cite each sentence with the chunk id; if you cannot answer from the spans, say so and stop.
Then we post-validate. Every sentence in the draft is matched back to its claimed span with a textual-entailment check. Sentences that fail are stripped. If the entire draft fails, the system returns "draft unavailable, escalate to product desk" — never a fabricated paragraph.
If your guardrail only runs at the retrieval layer, you will ship hallucinations. The check that matters is the one between the generated sentence and the cited span, after generation, before the draft path.
Access control before retrieval
Not every relatiebeheerder should see every fiche. Some products are restricted to private-banking clients; some to specific suitability tiers; some to the team that originated them. The naive approach is to filter results after retrieval. Do not do this. If a restricted document is retrieved and then filtered out, you have leaked its existence in latency and in occasional ranking artefacts that show up when the same query returns visibly different counts for two roles.
We push the access predicate into the retrieval call itself. The vector store knows the caller's role and the chunk's clearance, and the index never returns a chunk the caller cannot see. The refusal path is identical whether the document does not exist or the caller is not allowed to see it. Two months in, a compliance officer asked us how we would prove that property; we ran the eval set with three different role personas and showed the diff. The conversation ended in fifteen minutes.
Evaluation, not vibes
We built a golden set of 412 questions, hand-annotated by two of the bank's senior relatiebeheerders over four afternoons. Each question carries the correct source document, the correct section, and a one-line summary of the correct answer. We re-run the full evaluation every time we change a model, a threshold, or a chunking rule.
The number compliance cares about is not accuracy. It is citation correctness: of answered questions, how often is the cited span actually the one that supports the answer. At launch we measured 96.1% citation correctness, 88% answered, 11.4% refused, and effectively zero ungrounded sentences across the eval set. Refusals route to the product desk.
Rollout
We rolled out to five relatiebeheerders for the first two weeks. Their feedback was unglamorous: the answer is correct but the tone is too formal; the answer is correct but it cites the wrong KID version when two are equally fresh; the answer is correct but the relatiebeheerder needs the response in French because the client is Walloon. Each of those is a different ticket. None of them are model choices.
By week six, all 25 were on the system. Average response time on a relatiebeheerder query dropped from "I will get back to you this afternoon" to under fifteen seconds. The bank's compliance officer audits a random 2% of drafts weekly and signs off on the chain.
What broke first
Three things, all predictable in hindsight.
First, the Olympic Banking archive's character encoding was inconsistent. Older fiches mixed Windows-1252 and UTF-8, and silently corrupted Dutch and French accents made retrieval miss exact-match queries on things like "rémunération." We wrote a normaliser and re-indexed.
Second, when a KID was re-versioned mid-week, the index still served the old chunk until the next nightly job. A relatiebeheerder got a perfectly cited answer to a stale document. We moved to event-driven indexing on the document management system's webhook and added a freshness check at retrieval that downweights any chunk older than the latest vetted version for the same ISIN.
Third, the vector store had a quiet IO ceiling we did not feel until concurrency rose. At five relatiebeheerders the latency sat under 1.2 seconds. At twenty-five, two simultaneous queries against the same shard pushed p95 close to five. The fix was unglamorous: a read replica, connection pooling, and a 50 ms in-memory cache on identical queries inside a sliding ten-second window. p95 settled at 1.6 seconds. We did not need a fancier vector store. We needed a sysadmin's instinct.
The smallest thing you can do this week
Before any of this — before a vector store, before a reranker, before a prompt — open the folder you would ask a RAG system to draw from and answer one question: can a compliance officer tell, today, which document is the canonical version? If the answer is "mostly," you do not have a RAG project. You have a document-management project that happens to come first.
When we built this for the bank on the Meir, the piece we kept and reused was the citation gate at the top of this post: the small function that refuses to let an ungrounded sentence reach a draft folder. It is the kind of artefact that survives a model change, a vendor change, and a regulator change. If you are looking at a similar archive with a regulator over your shoulder, this is the shape of AI agents work we keep doing at ABN.
Key takeaway
In a regulated RAG system the citation chain is the product. The model is replaceable; the gate that refuses to let an ungrounded sentence reach a draft folder is not.
FAQ
Why chunk by KID section instead of fixed-token windows?
Because a citation only helps compliance if it lands inside the right regulated section. Section-aware chunks make 'risks' and 'costs' separate rows and stop the model from blending them, and they let you reindex only the section that actually changed.
What does 'citation correctness' actually measure?
Of the questions the system answers, the percentage where the cited span genuinely supports the generated sentence. Accuracy can be high while citation correctness is low, and only the second one is auditable.
How do you handle a KID that gets re-versioned mid-week?
Event-driven re-indexing on the DMS webhook, plus a freshness rule at retrieval that downweights any chunk older than the latest vetted version for the same ISIN.
How do you stop a French query from leaking a restricted Dutch document?
Push the access predicate into the retrieval call itself, not a post-filter. The vector store never returns a chunk the caller is not cleared to see, and the refusal path is identical whether the document does not exist or the caller is not allowed to see it.
Does FSMA require human review of every AI-drafted answer?
The regulator does not mandate a specific workflow, but it does require a reconstructable recommendation chain. In practice, that means every draft has to carry citations a reviewer can open months later.