← Blog

RAG

Citation-first RAG: how we shipped a Dutch law firm's agent

A 27-person Den Haag firm answers 1,560 bezwaarschrift questions a week. Their associates wanted speed, not hallucinations. Here is how we built the agent that refuses to draft without an ECLI.

Jacob Molkenboer· Founder · A Brand New Company· 19 Jun 2026· 9 min
Open wooden index-card drawer with cream tabs, one chartreuse tab, brass divider, leather ledger, stamped card, ivory desk.

It is a Tuesday afternoon in March, and a senior associate at a 27-person publiek-recht firm in Den Haag is staring at a stack of 14 bezwaarschriften that all need a conceptbrief by Friday. Each one needs three things: a clean recital of the facts, the right Awb-artikelen, and one or two ECLI-cited uitspraken that argue the angle she wants to take. Last year, that afternoon would have eaten six billable hours of pulling case law from Rechtspraak.nl, three different bound binders, and a thirteen-year-old Cleverdesk dossier-archief that nobody on the IT side wanted to touch.

This year it takes her forty minutes. The agent does the pulling. It refuses to draft anything until every passage in its context is pinned to a vetted ECLI or a specific Algemene wet bestuursrecht artikel. Below is the citation-first RAG playbook we wrote for that, in the order we wish we had done it.

The shape of a bezwaarschrift afternoon

The firm handles roughly 1,560 bezwaarschrift-vragen a week: a mix of intake triage, “is this even bezwaar-vatbaar,” missing-termijn checks, motivering reviews, and the conceptbrief drafts themselves. Roughly half are answered inside ten minutes by a paralegal; the rest pull a junior or senior in for real research. The corpus they reach for is two stacks:

  • ~41,000 Awb-jurisprudentie uitspraken, curated by the firm's own knowledge partner over fourteen years. Not all of Rechtspraak.nl, but a vetted subset with internal annotations on relevance.
  • A 13-year-old custom Cleverdesk dossier-archief with ~28,000 closed dossiers, including pleitnota's, memoranda, and the firm's own “what we would argue next time” notes.

Two things mattered before the first line of code. One: every output had to be citable. The firm's professional liability insurer was not going to wear a phantom-ECLI risk. Two: the Cleverdesk archive could not be migrated in the same quarter. It was alive, in use, and full of access-control quirks that nobody had documented since 2014.

The citation-first invariant

The single rule we wrote on the whiteboard, in marker, on day two:

The invariant

No token reaches the conceptbrief draft path unless every passage in its context is bound to a verified ECLI or a specific Awb-artikel. The verification happens before the LLM sees the chunk, not after.

Most RAG systems are written in the opposite order: retrieve, generate, then ask the model to “cite your sources.” That works for blog summaries. It does not work for a kantoor whose tuchtrecht risk is real. By the time the LLM has composed a sentence around an unsupported claim, the damage is already done. Even a “rejected” output costs review time.

So we inverted it. The retrieval layer's job is not “find relevant chunks.” It is “find chunks that already carry a verified citation handle, and pass nothing else downstream.” A passage without an ECLI or an Awb-artikel reference simply does not enter the prompt.

Pulling thirteen years out of Cleverdesk without migrating it

The Cleverdesk archive is a custom PHP/MySQL system from 2013 with two web frontends bolted on, a SOAP endpoint nobody remembers writing, and a documents table that stores everything as base64 in a longblob column. It is the kind of system you do not turn off, and also do not replatform on a deadline.

What we did instead: a one-way change-data-capture stream from Cleverdesk's MySQL into a parallel index. A small Go process tails the binlog, extracts the longblob payloads, runs them through Apache Tika for text, and writes them into a dedicated table in our own Postgres with pgvector. The Cleverdesk UI keeps working. Nothing inside Cleverdesk changes. The firm's access-control rules are mirrored as row-level security policies, so the agent can never surface a dossier the asking user cannot already see in the legacy UI.

create table dossier_chunks (
  id          bigserial primary key,
  dossier_id  text not null,
  source      text not null check (source in ('cleverdesk','rechtspraak','firm_memo')),
  ecli        text,
  awb_artikel text,
  passage     text not null,
  embedding   vector(1024),
  acl_group   text[] not null,
  created_at  timestamptz default now()
);

create index on dossier_chunks using hnsw (embedding vector_cosine_ops);
create index on dossier_chunks (ecli) where ecli is not null;
create index on dossier_chunks (awb_artikel) where awb_artikel is not null;

Two indexes you might not expect: one on ecli and one on awb_artikel. They are not for vector search. They are for the verification gate. We hit them on every single query.

Chunking jurisprudentie without losing the ratio decidendi

The 41,000 Awb-uitspraken came in as PDFs, HTML scrapes from rechtspraak.nl, and a few thousand from a paid corpus the firm subscribes to. Generic chunking, say 800 tokens with 100 overlap, destroyed these. A bezwaar-relevante overweging often sits eight paragraphs into a uitspraak, and the legal conclusion ten paragraphs further down. Cut between them and you have orphans on both sides.

We chunked along the document's own structure instead: feitelijke gronden, beoordeling, overwegingen, dictum. Every chunk carries the ECLI of its parent uitspraak, plus a typed label for which section it came from. A retrieval that surfaces a dictum chunk pulls its accompanying overweging automatically, because the LLM cannot reason about the operative ruling without seeing the reasoning that produced it.

For the firm-memo and pleitnota corpus, we used a different strategy: the firm's own paragraph numbering was already meaningful, so we respected it.

Retrieval that doesn't pretend it understands Dutch law

We deliberately did not fine-tune a model on Awb jurisprudence. Fourteen months from now we will revisit it, but the cost-to-confidence ratio in 2026 is wrong: every domain fine-tune is one Hoge Raad ruling away from being subtly stale, and stale legal reasoning is worse than no reasoning.

Retrieval instead runs in three lanes that score independently and merge:

  • Dense vector over a multilingual embedding model. Cosine similarity. Handles paraphrase well, ignores rare ECLIs.
  • BM25 over the same chunks. Catches the exact ECLI or Awb-artikel reference the user might already have typed.
  • Citation-graph expansion. If a chunk cites another uitspraak by ECLI, that target is pulled too at a discounted score. This is how we recover the ladder of cases that build on each other.

The three lanes feed a reranker that knows about source type. A senior-memo chunk is allowed to win against a uitspraak only if the question is procedural rather than substantive. That rule was written by one of the firm's partners on a whiteboard, not by us.

The verification gate

This is the part that matters most, and the part most RAG playbooks skip.

Between the retriever and the LLM sits a verification step. Every candidate chunk is checked against two ground-truth indexes:

  1. An ECLI registry mirrored from rechtspraak.nl, refreshed nightly. A chunk's ECLI must resolve to a real uitspraak, and the chunk's claimed publication date must match.
  2. An Awb-artikel registry derived from wetten.overheid.nl, including the article's geldigheidsperiode. A chunk citing artikel 6:7 must be checked against the version of the law that was in force at the moment of the underlying besluit, not today's.

If a chunk's citation does not verify, it is dropped silently, with a log line. We do not patch or guess. Either the citation is real and current, or the chunk does not exist for the purposes of this query.

def gate(chunk: Chunk, besluit_date: date) -> Chunk | None:
    if chunk.ecli:
        u = ecli_index.get(chunk.ecli)
        if not u or u.published_at != chunk.cited_date:
            return None
    if chunk.awb_artikel:
        v = awb_index.in_force(chunk.awb_artikel, on=besluit_date)
        if not v:
            return None
    return chunk

The gate is boring code. It is also where 80% of the project's quality lives. We invested two of the six project weeks here and we would do it again.

ECLI normalisation, the boring part that matters

ECLI strings look standardised. They are not, in the wild. We saw ECLI:NL:RBDHA:2019:1234, ECLI:NL:RbDHA:2019:1234, ECLI NL RBDHA 2019 1234, and once, in a 2014 memo, a hand-typed ECLI:NL:RBSGR:2019:1234 where the rechtbank code had been retired three years earlier.

One gotcha worth flagging: if you are building anything that cites Dutch case law, normalise ECLI strings at ingest, store the canonical form, and also store every observed variant. You will need both. One for matching the user's typed query, one for matching what your own corpus said in 2014. We keep an alias table mapping observed forms to canonical forms, and a retired-court table for the ones like RBSGR that were renamed. The verification gate consults both.

Evaluating on bezwaarschriften the firm already lost

The most useful evaluation set is not a synthetic one. The firm gave us 220 historic bezwaarschriften where the outcome was already known (won, lost, or settled), and the partners had marked, after the fact, which uitspraken would have helped. We measured the agent on whether it surfaced those uitspraken in its top 10 retrieved results.

Baseline (dense vector only): 41% of marked uitspraken in top 10. With the three-lane retrieval and citation-graph expansion: 78%. With the verification gate dropping unsupported chunks: still 78%, because the gate drops noise, not signal. That last number is the one the partners cared about.

We do not report a hallucination number, because the architecture makes the question malformed: an unsupported claim cannot reach the draft path. What we do report is the rate at which the agent refuses to answer, currently 11% of queries, mostly because the relevant jurisprudentie is too recent to have been ingested. The associates prefer a refusal to a bluff.

What changed for the associates

The agent does not draft conceptbrieven end to end. That was never the goal and the firm did not want it. What it does:

  • Answers the “is this bezwaar-vatbaar” intake questions in seconds, with the relevant Awb-artikel cited.
  • For a given casus, pulls the top five candidate uitspraken with a one-paragraph summary of why each is relevant and a direct ECLI link to the full text.
  • Generates a structured research brief (facts, legal grounds, supporting case law, opposing case law) that the associate edits into a conceptbrief.

The firm tracked associate time on bezwaar-research before and after. Median time on a conceptbrief dropped from 3h10 to 1h05. The partners did not cut billable hours; they redirected them toward intake of new files. The firm took on 30 more dossiers in Q1 2026 with the same headcount.

The smallest thing you can do this week

Before you architect anything, write down your equivalent of the citation-first invariant. One sentence. Make it falsifiable. If you cannot say in one line what your agent is forbidden from doing, you do not yet have a spec. You have a vibe. Tape it to a wall.

When we built this RAG agent for the Den Haag firm, the thing we kept running into was that every “smart” addition to the prompt eroded the invariant a little. We ended up solving it by moving every verification step out of the prompt and into deterministic code that the LLM never sees.

Key takeaway

Run the citation gate in deterministic code before the LLM sees a chunk. Verifying after generation is theatre: by then the unsupported sentence already exists.

FAQ

Why not fine-tune a model on Dutch case law?

The corpus shifts with every Hoge Raad ruling. A fine-tuned model goes stale silently. We kept the LLM general and pushed all domain knowledge into retrieval and verification, which we can refresh nightly.

Can the agent draft a full conceptbrief end to end?

It can, but the firm does not want it to. It produces a structured research brief that the associate edits. The human stays on the draft path. That is a deliberate choice, not a technical limit.

How do you handle access control across the legacy Cleverdesk archive?

We mirror Cleverdesk's per-user dossier ACLs into Postgres row-level security policies. The agent inherits exactly what the asking user could already see in the legacy UI, never more.

What happens when a chunk's ECLI does not verify?

It is dropped silently and logged. We do not patch, guess, or warn the user. Either a citation is real and current, or that chunk does not exist for the purposes of the query.

Did you migrate Cleverdesk in the same project?

No. We tailed its MySQL binlog into a parallel Postgres index and left the legacy system running untouched. Migration is a separate conversation, on a separate timeline, with a separate risk profile.

ragai agentsknowledge basecase studyintegrationslegacy sites

Building something?

Start a project