RAG

RAG under AFM supervision: a Dutch insurance broker playbook

A schade-expert in Gouda picks up at 14:47. A client's freight container is in the IJsselmeer, the policy runs 84 pages, and he has six minutes to find the clause.

Jacob Molkenboer· Founder · A Brand New Company· 21 Jun 2026· 12 min

Open wooden index-card drawer with one green card upright among cream cards, brass divider, folded papers, red wax seal.

It is 14:47 on a Tuesday in Gouda. A schade-expert at a 27-person verzekeringsmakelaar picks up the phone. A client's freight container is on the bottom of the IJsselmeer. The hoofdpolis runs 84 pages, the carrier's small-print module is another 31, and the client wants to know, right now, whether marine cargo extends to inland water and whether the new flood exclusion from the herziene voorwaarden of 2025 applies to a vessel that left harbour the day before the herziening took effect. He has six minutes before the loss adjuster calls back. On his second monitor, a RAG agent has already drafted an answer with the artikel, versie, and effective date attached.

That call happens 1,320 times a week at this firm. Multiply by 18 insurers, each with their own polisvoorwaarden, each shipping two or three herzieningen per year, and a 13-year-old dossier-archief in a custom CCS Insurance install nobody at the office still fully understands. The schade-experts are sharp. They are not search engines.

This is the playbook for the RAG agent we built for them, focused on the part that actually matters: shipping it under AFM supervision without anyone losing sleep over a Kifid complaint.

What the regulator actually wants

The shorthand is “an AI that cites its sources.” The Wft, in practice, demands something narrower. Under article 4:23 every dekkingsadvies must be passend, and every passend advies must be reconstructable. If a schade-expert tells a client “your container is covered,” and the client later disputes that at Kifid, the firm has to show which voorwaarde, in which versie, on which datum, supported the conclusion. Not roughly. Exactly.

An LLM that hallucinates a clause is not a productivity tool in that context. It is a regulatory event. The entire architecture follows from that single constraint, and almost every interesting decision we made was about narrowing what the model is allowed to say.

Takeaway

If a generated passage cannot be traced back to a vetted source with article number, document version, and effective date, the agent does not get to say it.

Why generic RAG falls over on Wft documents

The default RAG recipe — chunk every PDF by 512 tokens, embed with a multilingual model, cosine-similarity search, stuff the top eight into a prompt — works fine for a customer-support knowledge base. It collapses on insurance corpora for three reasons.

First, polisvoorwaarden are not prose. They are nested structures: hoofdstuk → artikel → lid → sub. A clause about flood damage can sit eight levels deep and be modified by a definitions block 40 pages earlier. Chop it at 512 tokens and you lose the definitions. Chop it bigger and the embedding becomes too diffuse to retrieve. Either way, retrieval recall drops and the model fills the gap with plausible nonsense.

Second, versies matter. Insurer A's “Algemene Voorwaarden Transport 2022” and “Algemene Voorwaarden Transport 2024-Q3” share 90% of their text. Cosine similarity loves them equally. The schade-expert needs exactly the version that was attached to the polis on the inceptiedatum of the dossier under discussion. Cosine has no opinion about which one that is.

Third, the 13-year CCS archive is full of free-text claim notes, scanned PDFs, and email threads where binding decisions were actually made. Half of it is not searchable until you OCR it, and the OCR has to preserve dossiernummer and datum or you can never link a retrieved passage back to a real case. The first OCR pass we ran gave us 94% character accuracy and 31% dossier-linkage accuracy. The second pass, with layout-aware extraction and a regex lookup against the CCS index, hit 99.6% on linkage. The first number was useless; the second was the only one that mattered.

The corpus, structured

We modelled four document families and gave each its own ingestion path:

Polisvoorwaarden (31,600 documents across 18 insurers). Parsed structurally, not by token count. Each artikel becomes one chunk, with its parent hoofdstuk title, all in-scope definitions, and a metadata header carrying insurer, product, version, and effective date range.
Wft, Bgfo, and lower regulation. Pulled from wetten.overheid.nl with the official article anchors. Versioned by wijzigingsdatum, with deep links back to the source.
Kifid uitspraken. Scraped, deduped, tagged by onderwerp and uitkomst. Used only as supporting precedent, never as primary advice.
CCS dossier-archief. Thirteen years of claims. Internal-only, never cited to a client, used as a retrieval signal for “have we seen this before” and nothing else.

Every chunk carries a vetted: true | false flag. The CCS archive is never vetted. Polisvoorwaarden are vetted by the compliance officer when a herziening lands; this takes them roughly 40 minutes per insurer per release because the diffs are tooled. Wft and Kifid are vetted at ingestion against the official feeds, with a hash check on every refresh.

Retrieval in two passes

The retrieval pipeline runs in two passes because the second pass exists to make the agent shut up when it should not speak.

The first pass is hybrid: BM25 over structured fields (insurer, product, dossiernummer, dekkingscode) intersected with dense retrieval over the chunk body. We use a Dutch-tuned embedding model and add a small boost when the chunk's effective date range covers the dossier's inceptiedatum. The first-pass result set is up to 40 chunks; we deduplicate by artikel and rerank with a small cross-encoder before continuing.

The second pass is the citation gate. Before any chunk reaches the drafting prompt, it has to pass three checks:

def citation_gate(chunk, dossier):
    # 1. Version must match what was on the polis at inceptiedatum
    if chunk.doc_type == "polisvoorwaarden":
        if not chunk.effective_range.contains(dossier.inceptiedatum):
            return False, "version mismatch"

    # 2. Source must be vetted; CCS archive never reaches the draft
    if not chunk.vetted:
        return False, "unvetted source"

    # 3. Citation handle must resolve to a live URL or internal doc id
    if not resolve_citation(chunk.citation_handle):
        return False, "broken citation"

    return True, None

A chunk that fails the gate is logged with its reason and dropped from the prompt context. If fewer than two chunks survive, the agent is not allowed to draft a dekkingsadvies at all; it returns a structured onvoldoende_onderbouwing response and routes the question to a human schade-expert with the top-40 first-pass set attached as background reading. The temptation when shipping this was to let the model fall back on the CCS archive in those moments, pattern-matching on a similar old case and writing it up. We did not. The CCS archive is never summarised into an answer. That is precisely the failure mode that ends in a Kifid uitspraak against the broker.

Drafting with hard citations

The drafting prompt is boring. That is on purpose. It receives the surviving chunks, a system instruction that every claim must be tagged with the chunk id it came from, and a refusal clause for anything not covered.

You are drafting a dekkingsadvies for a schade-expert.
You may only state facts that are supported by one of
the provided passages. Tag every factual sentence with
the passage id in square brackets, e.g. [P-7].
If the provided passages do not cover a question,
write exactly: "Onvoldoende onderbouwing in de vetted bronnen."
Do not summarise, do not extrapolate, do not infer.

The output then runs through a structural validator: every sentence ending in a factual claim must contain at least one [P-x] tag. Sentences without tags get stripped before the draft is shown. The schade-expert sees the draft with footnoted passages, opens any source in one click, and either accepts, edits, or rejects. The accept/edit/reject signal feeds back into the reranker training set.

What v1 got wrong

We did not arrive at this shape on the first try. The first build had three problems we only spotted in week six of internal testing.

The cross-encoder we used initially was English-tuned, and it kept downranking the Dutch polisvoorwaarden chunks against the Wft references — exactly backwards from what the schade-experts needed. Swapping for a Dutch reranker fine-tuned on roughly 600 accept/edit/reject signals fixed the ordering inside a week.

The citation handles in v1 pointed to raw chunk IDs. Useful for the system, opaque to the compliance officer trying to audit a draft. In v2 every handle carries the human-readable triple of (insurer, document version, artikel) so a click and a glance answers “where did this come from.”

The biggest miss was overconfidence in the gate. Early on we let the model proceed if just one chunk survived. That produced confident-sounding single-source advice that was technically defensible but practically thin. Bumping the floor to two chunks reduced output volume by 9% and complaint rates from internal reviewers by more than half.

The audit trail AFM actually reads

Every conversation writes a single immutable row: the question, the first-pass retrieval set, the surviving chunks, the prompt sent to the model, the model's raw output, the schade-expert's final advice, and a hash of all of it. The hash chains to the previous row, so any tampering shows up on verification. Storage is append-only; the application user has no DELETE privilege on that table.

When AFM came in for a thematisch onderzoek in March, the compliance officer pulled six random conversations from the previous quarter and reconstructed the citation chain on a laptop in fifteen minutes. That was the point. Everything else in the system is a means to that fifteen minutes.

What it did in production

Over the first sixteen weeks the agent handled 21,300 internal queries. The numbers the firm cared about:

Median time-to-first-draft on a coverage question dropped from 11 minutes to 38 seconds.
Of the drafts shown, 71% were accepted by the schade-expert with zero edits, 22% with edits, 7% rejected outright.
The onvoldoende_onderbouwing refusal fired on 6.4% of questions. Every one of those went straight to a human, and not into a fabricated answer.
Zero Kifid complaints traceable to an agent-drafted advies in the period.

The interesting number is the 6.4%. That is the rate at which the system correctly admitted it could not help. A generic RAG would have happily answered all of those, and roughly half would have been wrong in ways that surface six months later.

The wider regulatory wind

This kind of architecture is not optional much longer. Financial regulators across Europe are paying close attention to AI in adviesketens, and AFM's published priorities increasingly point at verklaarbaarheid as the open question. If your RAG system cannot tell a regulator which clause supported which sentence, you are early on the wrong side of a curve that is steepening, not flattening.

The smallest thing you can do today

If you have a RAG system in any regulated context, run one audit this afternoon. Pick five answers it generated last week, and try to reconstruct the citation chain back to a versioned, vetted source. If you cannot do that for all five, your gate is not yet where it needs to be. When we built this for the Gouda broker, the thing we kept running into was the 13-year CCS archive: the temptation to let the model “just look in there for similar cases” was constant, and it would have been the failure mode. We solved it by treating the archive as a retrieval-only signal that never enters the draft, paired with a hard RAG agent citation gate. The boring constraint is what kept the firm out of Kifid.

Key takeaway

In regulated AI advice, the citation gate is not a feature on top of the RAG system. It is the whole product.

FAQ

Does the agent replace the schade-expert?

No. It drafts a citation-backed dekkingsadvies that the schade-expert accepts, edits, or rejects. The final advice and the regulatory responsibility stay with the human.

How do you handle voorwaarden updates from 18 insurers?

Each herziening is ingested as a new versioned document. The compliance officer vets the diff before the new version is flagged as available for retrieval, which takes about 40 minutes per insurer.

What happens when no chunk survives the citation gate?

The agent returns a structured onvoldoende_onderbouwing response and routes the question to a human, with the top-40 first-pass retrieval set attached. It does not attempt to answer.

Can a 27-person broker afford this kind of system?

Yes, because most of the cost is in the ingestion and citation gate, not the model calls. Per-query LLM cost is a few cents; the saved time per schade-expert pays for it inside a quarter.

Is the CCS dossier-archief ever shown to a client?

No. It is used as an internal retrieval signal for pattern matching across the 13-year history, but it is never quoted in a draft and never reaches the client-facing dekkingsadvies.

ragai agentsknowledge basecase studyarchitectureoperations

Building something?

Start a project