RAG

RAG for bouwbesluit: a citation-gated agent playbook

A Deventer architectenbureau fields 1,240 bouwbesluit-vragen a week. Here is the RAG playbook that answers them with a citation on every line.

Jacob Molkenboer· Founder · A Brand New Company· 25 May 2026· 10 min

Open oak index-card drawer with ledger paper, brass divider, one green card, linen ribbon on a blotter by a window.

It is 9:43 on a Tuesday in Deventer. Mariëlle is the senior projectarchitect on a school renovation in Apeldoorn, and the gemeente's plantoetser has just asked whether the new gymzaal still meets the eisen for daglichttoetreding under Bbl artikel 4.96. The reference she wants lives in three places: a 2018 NEN 2580 measurement report, a 2023 Bouwbesluit memo that was already half-rewritten when the Omgevingswet went live in January 2024, and a BIMcloud comment thread from the original 2014 design phase. She has fourteen minutes before the call.

That is the moment the RAG agent has to land. Not a chatbot demo, not a “summarise this PDF” toy. A system that answers a regulated question with the right citation, every time, or refuses cleanly.

This is the playbook we wrote for a 26-person architectenbureau in Deventer that fields ~1,240 of these questions a week across 38,200 NEN 2580-clausules, 14,600 Bbl-artikelen and a twelve-year-old Archicad BIMcloud archive. The hard part is not the retrieval. The hard part is making sure no token ever reaches a vergunningsaanvraag draft without a citation a Wkb-kwaliteitsborger can sign off on.

Chunk on the clause, not the token

The first instinct on every RAG project is to chunk by character count. On bouwregelgeving that is malpractice.

A Bbl-artikel has a fixed structure: artikel.lid.onderdeel.sub. A clause that ends mid-sentence loses its scope. A clause merged with its sibling loses its uitzondering. We chunk on the smallest legally addressable unit — usually the lid — and carry the full address on the chunk.

@dataclass
class Chunk:
    text: str
    clause_id: str          # "bbl:4.96.2.a"
    source_doc: str         # "bbl" | "nen-2580" | "bimcloud"
    version_date: date      # 2024-01-01
    parent_path: list[str]  # ["Hfd. 4", "Afd. 4.4", "Par. 4.4.2"]
    superseded_by: str | None
    legal_basis: str        # "Omgevingswet art. 4.3"

The superseded_by field is what makes the system survive a wetswijziging. When the Bouwbesluit 2012 became the Bbl in 2024, every chunk that referenced the old artikel got a forward link. The retriever follows it. The generator is told, in the system prompt, that any citation older than the project's peildatum must be rewritten in current law before it leaves the agent.

Hybrid retrieval with a clause graph

Architects do not ask in one register. They alternate between three modes inside a single conversation:

“Wat staat er in 4.96 lid 2?” — exact lookup, BM25 wins every time.
“Mag een binnenwand bij een schoolgebouw uit gipsblokken?” — semantic, dense vectors win.
“Welke artikelen verwijzen naar NEN 6068?” — graph query, neither will do.

We run all three and re-rank. BM25 on the clause text and address. Dense embeddings (we use bge-m3 — it handles Dutch and the legal register reasonably well) on a concatenation of clause text plus parent paragraph headers. And a Neo4j-backed clause graph that stores every cross-reference, every tenzij-exception, and the mapping between an NEN-norm and the Bbl-artikel that incorporates it by reference.

The re-ranker is a small cross-encoder fine-tuned on ~3,000 question/clause pairs the kantoor's senior architects labelled over two weeks. That labelling round is non-negotiable. Without it you ship a model that ranks impressively on MTEB and unimpressively on a real plantoetser-vraag.

Citation-first generation

The generator never speaks first. It cites first.

We force a two-pass output. Pass one is a JSON object of {claim, citation} pairs. Pass two stitches them into Dutch prose. A guard between the passes rejects the response if any claim has no citation, if any citation's chunk does not actually entail the claim's text (we check via a small NLI head), or if the citation address does not resolve in the clause graph.

def validate(pairs: list[ClaimCitation], graph: ClauseGraph) -> Result:
    for p in pairs:
        if not p.citation:
            return Reject("uncited claim", p.claim)
        chunk = graph.get(p.citation)
        if chunk is None:
            return Reject("hallucinated address", p.citation)
        if not entails(chunk.text, p.claim, threshold=0.82):
            return Reject("citation does not support claim", p)
    return Accept()

Rejected responses do not surface to the architect. They route to a needs human queue with the failed pair attached. After eight weeks of production traffic the queue averaged 31 items per week — about 2.5% of vragen — and almost all of them turned out to be genuinely ambiguous, not model failures.

The BIMcloud archive nobody indexed

The twelve-year Archicad BIMcloud archive is where every consultancy underestimates the work. It is not a database. It is a sediment.

Layer one is the IFC exports — geometry plus a thin layer of property sets. Layer two is the PDF-printed plantekeningen, half of them scanned from paper before 2017. Layer three is a tangle of comment threads, change requests, and “zie e-mail Jeroen 14/3” annotations that only make sense if you were in the room.

What we built sits in three jobs:

A nightly walker over the BIMcloud REST API that snapshots IFC and PDF exports per project, fingerprinted so we re-index only what changed.
OCR over the pre-2017 PDFs with ocrmypdf (Tesseract back-end, Dutch and English packs). Confidence below 0.7 gets flagged and never enters the index.
An IFC-to-clause linker that reads property sets like Pset_FireSafety and proposes a candidate Bbl-artikel. A senior architect confirms or rejects in a 30-second UI we built for them. Six weeks in, the linker auto-approves at 0.9 precision and the human queue is empty most days.

The archive is read-only to the agent. Live project work happens in current Archicad files; the agent uses the archive for “wat hebben we hier in 2019 ook al weer over geschreven” questions. That separation matters for the Wkb gate.

Wkb as a hard gate, not a checkbox

The Wet kwaliteitsborging voor het bouwen turned the kwaliteitsborger into a named, liable party for gevolgklasse 1 projects, which now covers most of what this kantoor builds. That has consequences for any agentic system you put near a permit.

We split the agent into two paths:

Read path. Answers questions, cites clauses, surfaces BIMcloud passages. No outputs that touch a permit document. No structured writes.
Draft path. Generates concept-text for hoofdstukken of a vergunningsaanvraag. Every paragraph requires a kwaliteitsborger signature inside the dashboard before the text can be exported into the official template.

The split is not a UX nicety. It is enforced at the API gateway. Read-path responses carry a bbl:answer:v1 content-type. Draft-path responses carry bbl:draft:v1 and require an HMAC header signed with the kwaliteitsborger's hardware token. No signature, no draft, full stop.

Warning

If the kwaliteitsborger is the same person who runs the agent in daily work, you have not built a gate. You have built theatre. Separate the role from the seat.

Identity and the agent that runs at 3 a.m.

Two pieces of news from the last two months changed how we provision agent identity for this kantoor.

Anthropic announced that certain API capabilities will require identity verification starting July 8, 2026. For a firm that runs nightly agents against a regulated corpus, that is not a hurdle, it is a feature: the kwaliteitsborger's identity belongs on the API account, not a service-account stub. We moved billing to a verified Anthropic Console workspace owned by the kantoor's compliance lead. Audit trail starts at the bill.

For ephemeral runs — the per-project sandboxes the agent uses for one-shot drafts — we adopted the pattern Cloudflare wrote up for temporary accounts scoped to AI agents. Every nightly job spins up a scoped account with a 24-hour TTL, project-scoped permissions, and an audit log that ends up in the same store as the kwaliteitsborger's sign-offs. If something goes wrong, the blast radius is one project and one day.

What we measure, weekly

The kantoor's senior partner did not ask for an AI dashboard. He asked four questions, and we built the dashboard around them:

How many vragen did the agent answer this week, and how many escalated to a human?
Of the escalations, how many were the agent being correctly cautious vs. failing to retrieve?
What is the median time from vraag to cited answer? Target: under 6 seconds. Current: 4.1.
Did any draft-path output leave the system without a kwaliteitsborger signature? Target: zero. Actual: zero, monitored as a hard alarm.

The second metric is the one most teams skip. A RAG system that refuses too often is not safer; it is just useless in a louder way. We tag every refusal with a reason code and review the top three each week with the architects who saw them.

What to skip on day one

Three things look essential for this class of problem and are not:

A fine-tuned generator. Your labelled examples should go to the re-ranker first. The base model only needs a tight system prompt and the citation guard.
A vector database with a fancy filter language. Postgres plus pgvector plus a clause_address btree index handles this corpus comfortably, and the kantoor's existing IT staff can operate it without a new vendor contract.
A streaming UI. Architects want the full cited answer or none of it. Streaming half a citation reads as a half-formed legal opinion, and they will treat it as one in front of a plantoetser.

The work behind one cited paragraph

When we built this for the Deventer kantoor, the thing we kept hitting was the gap between the model produced a citation and the citation actually supports the claim. The NLI guard between the two generation passes was the single change that made the kwaliteitsborger willing to sign off on the draft path at all. If you are scoping a similar system, that is where the next two weeks of your engineering budget should go — long before you debate which embedding model to pick. The architecture sits inside our AI agents practice if you want to walk through it against your own corpus.

The smallest thing you can do today: take one week of your firm's bouwbesluit-vragen, label which artikel each one should resolve to, and run a plain BM25 baseline over the Bbl text. If BM25 alone hits above 60% top-3 accuracy on clause addresses, your retrieval problem is small and your prompt problem is large. If it does not, retrieval is the project, and everything else is decoration.

Key takeaway

On regulated corpora, the RAG win is not a bigger model. It is a clause-level chunker, a citation guard that rejects unsupported claims, and a signed draft path.

FAQ

Why chunk on the lid instead of a fixed character window?

A lid is the smallest unit a Bbl-artikel can be cited at. Splitting mid-lid loses the uitzondering or scope, and merging two lids creates a citation that does not exist in the wet.

Does the agent draft the vergunningsaanvraag itself?

Only on the draft path, and only paragraph-by-paragraph after a kwaliteitsborger signs each one with a hardware token. Read-path answers never reach a permit document.

Is a fine-tuned LLM needed for Dutch bouwregelgeving?

Not on day one. The labelled examples have a higher return on the re-ranker than on the generator. A tight system prompt plus a citation guard covers most of the gap.

How did you handle the 2024 Omgevingswet transition in old citations?

Each chunk carries a superseded_by link. Citations older than the project's peildatum are forwarded to their current Bbl-equivalent before the response is rendered.

What stack runs the retrieval layer?

Postgres with pgvector for dense and address indexes, an external BM25 service for lexical lookups, and Neo4j for the clause cross-reference graph. No new vendor was added.

ragai agentsknowledge basearchitecturecase studyoperations

Building something?

Start a project