← Blog

RAG

RAG for legal publishing: shipping cited answers at 2,260/week

An editor opens a question at 9:14 on a Tuesday: which Hoge Raad rulings since 2019 narrow artikel 6:248 BW? The agent answers in eleven seconds, with seven citations.

Jacob Molkenboer· Founder · A Brand New Company· 9 Dec 2025· 9 min
Open oak index-card drawer with manila cards, brass divider, chartreuse tab, folded slip on ivory linen.

An editor at the Leuven legal publisher opens a question at 9:14 on a Tuesday: which Hoge Raad rulings since 2019 have narrowed the scope of artikel 6:248 BW in commercial contracts? The agent answers in eleven seconds, with seven citations, every one resolving back to a vetted ruling on Rechtspraak.nl. She drags two paragraphs into next week's nieuwsbrief. By 9:18 she is on the next query.

That moment was eighteen months of work. The publisher serves 4,800 subscribers across Flanders and the Netherlands. Their editors answer about 2,260 jurisprudence questions per week across 84,000 ECLI rulings and a 12-year-old custom XML archive of lower-court material the publisher digitised before ECLI standardisation existed. The brief was simple to say and hard to do: ship a RAG agent that drafts answers for the editor, but never let an uncited passage reach the subscriber portal or the newsletter queue.

Two write paths, one citation gate

Most RAG demos stop at "answer in chat". A publisher cannot. Two write paths in their stack mattered.

The subscriber portal, where editors save answers as research memos under the firm's masthead. Once saved, a memo is invoiceable advice. The newsletter queue, where a paragraph promoted to the weekly digest lands in 4,800 inboxes before any human re-reads it.

We treated both as production write paths and built the agent as a pipeline that physically cannot deliver a token to either one without a verified source attached. The agent's output type is not string. It is (string, [Citation]), and the writer middleware refuses to persist a memo whose citations have not round-tripped to Rechtspraak.nl in the last 48 hours. If even one citation 404s during the verification call, the whole answer drops to "draft, editor review" and never enters the queue.

The corpus we were handed

84,000 ECLI rulings sounds like one thing. It is three.

Modern rulings (2013 onward): well-formed ECLI XML with structured rechtsoverwegingen, parties, court, date, and a stable URL on Rechtspraak.nl. Roughly 51,000 of them.

2007 to 2013: ECLI-tagged but inconsistent. About 9% have malformed XML. Half have no rechtsoverweging boundaries. We could index the body text but not always pinpoint a citation to a specific paragraph.

Pre-2007 custom archive: 23,000 lower-court rulings the publisher scanned and tagged in a bespoke XML schema in 2014. Schema lives in a Word doc. Nobody who wrote it still works there.

The temptation with this kind of corpus is to normalise everything into one shape and index it. Resist that. We kept the three corpora as three indexes with three retrievers, because the failure modes differ and you want to know which corpus a bad answer came from.

Chunking by ruling structure, not by token count

Every RAG tutorial recommends 512-token sliding windows. Legal text punishes that approach. A rechtsoverweging is a self-contained legal argument, sometimes 80 tokens, sometimes 1,400. Slice it down the middle and you have produced a chunk that cites half a holding.

We chunked along the ECLI XML structure.

def chunk_ruling(ruling: ECLIRuling) -> list[Chunk]:
    chunks = []
    # Each rechtsoverweging is one chunk, regardless of length.
    for ro in ruling.rechtsoverwegingen:
        chunks.append(Chunk(
            ecli=ruling.ecli,
            ro_number=ro.number,
            court=ruling.court,
            date=ruling.date,
            text=ro.text,
            url=f"https://uitspraken.rechtspraak.nl/details?id={ruling.ecli}",
        ))
    # Plus one chunk for the dictum (the operative ruling).
    chunks.append(Chunk(
        ecli=ruling.ecli,
        ro_number="dictum",
        court=ruling.court,
        date=ruling.date,
        text=ruling.dictum,
        url=f"https://uitspraken.rechtspraak.nl/details?id={ruling.ecli}",
    ))
    return chunks

For the pre-2007 archive, we wrote a parser that mapped the bespoke 2014 schema onto the same Chunk shape. That took three weeks. It was the single most valuable engineering decision in the project: a uniform Chunk shape across three corpora meant the retriever, the reranker and the citation gate did not need to know which archive an answer came from.

Hybrid retrieval with a domain reranker

We ran BM25 and dense vectors side by side. BM25 catches exact statute references, ECLI numbers and Dutch legal phrasing that embeddings smear. Vectors catch paraphrases and synonyms ("opzegging" vs "beëindiging"). Use Elasticsearch's BM25 or whatever your stack ships with. The implementation is not the story.

The story is the reranker. Off-the-shelf rerankers are trained on web text and they happily promote a 2003 lower-court ruling above a 2024 Hoge Raad arrest because the keyword match is denser. So we fine-tuned a small reranker on 18,000 editor-graded pairs the publisher had been collecting since 2021 in their internal research tool. They did not know it at the time, but those grades were the most valuable asset in the building.

Fine-tuning the reranker moved nDCG@10 from 0.71 to 0.86 on our held-out eval set. The same reranker, untuned, scored 0.63.

The citation gate

The agent's job is not to answer. The agent's job is to compose an answer in which every factual claim is grounded in a specific chunk it just retrieved. We enforce that with a two-pass structure.

Pass one: the model writes a draft with inline citation markers ([ECLI:NL:HR:2021:1313#3.4]) tied to chunks it retrieved.

Pass two: a separate verifier prompt receives the draft and the cited chunks (and nothing else), and must answer "is each claim supported by the chunk it cites, yes or no". Any "no" drops the claim to the editor review queue.

Before the draft reaches the subscriber portal write path, a third process resolves every ECLI back to its canonical URL on Rechtspraak.nl and confirms the ruling still exists with a HEAD request. We cache the HEAD result for 48 hours. Rulings get withdrawn rarely, but it happens, and the publisher would rather hold a memo than publish one that cites a vacated ruling.

Warning

Do not let your verifier prompt see the original question. If it does, it will rationalise weak citations to please the asker. Show it the draft and the chunks, nothing else.

The 12-year-old XML archive

Migrating the pre-2007 archive was the project's longest tail. The schema had three undocumented quirks: rulings could nest inside other rulings (for joined cases), the date field used four different formats across the corpus, and 312 rulings had been hand-edited with HTML inside the XML, which the original parser silently dropped.

We wrote a strict parser, ran it against the full archive, logged every parse exception to a CSV, and sat with the publisher's senior editor for two afternoons to triage the long tail. About 1.5% of the archive needed a manual touch. That number is small, but the editor's two afternoons of input was the difference between a corpus we trusted and a corpus we hoped was clean.

If you are migrating a legacy XML archive into a RAG pipeline, plan for those afternoons. They are not optional.

Editor in the loop, not human in the loop

"Human in the loop" is the phrase everyone uses. It is too low a bar. The editor is not validating, the editor is editing. We designed the UI so the agent's draft lands in the editor's normal workflow exactly the way a junior researcher's draft would: as a memo with track changes on, in the publisher's CMS, with the citations already linked. The editor reads, accepts or rejects each cited passage, and saves. The accept/reject signal goes straight into the grade collector and trains the next reranker.

That changes the economics. The editors are not doing extra work to grade the agent. They are doing their normal work, and the system records it. After four months in production, we had 4,800 fresh grades from real use, on top of the 18,000 historical ones. The reranker we trained on the combined set lifted held-out citation precision from 0.91 to 0.94.

The eval suite

We built a 480-question eval set graded by three senior editors, with each question scored on three axes: factual correctness, citation precision (do the cited rulings actually support the claim) and citation recall (did the agent miss a controlling authority the editors would have cited). The suite runs on every meaningful change to the pipeline. A full run takes 22 minutes and costs about €1.40 in inference. No prompt change, no retriever change, no chunker change ships to production until it comes back green.

Without that suite, every change is a vibe check. With it, the prompt tweak that bumped fluency but dropped citation precision from 0.94 to 0.88 was caught the same day, rather than three weeks later when an editor would have noticed a memo cited a reversed ruling. We also keep a per-corpus breakdown. The pre-2007 archive consistently lags the modern corpus by about six points on citation recall, and that gap tells us where the next round of parser work should go.

What this cost in shape, not euros

Eighteen months. Three engineers on our side, two editors on theirs at about 40% time, one senior editor at 20%. Total compute bill across training and eighteen months of production is under €18,000. The expensive part was editor time, and it was worth it.

A useful frame from a recent Hacker News discussion on AI-native startups: the moat is not the model. The moat is the eighteen-month dataset of editor grades, the parser for the bespoke 2014 schema, and the citation gate that turns the agent from a confident chatbot into a system editors are willing to sign their name to.

What we would do differently

Build the citation gate before the retriever. We built it last and almost shipped without it. The gate is the product. Build it on day one with a stub retriever behind it and you will design every other component to feed it cleanly. First-pass retrieval can be the dumbest BM25 you can stand up. Save the reranker work until the gate is real.

Version the corpus on day one. Rechtspraak.nl quietly republishes rulings with corrected metadata. We caught a case where an answer cited a ruling whose ECLI had been superseded. Now we snapshot the canonical URL response weekly and diff it.

Treat the editor-grade collector as the first product, not the last. The 18,000 historical grades trained the reranker. If we had shipped a grade-collecting UI in week one of the project, we would have had 5,000 fresh grades on the actual agent's output by month four. That is the dataset that compounds, and there is no substitute for starting it early.

When we built the agent for the Leuven publisher, the thing we kept underestimating was the verifier prompt: how often it silently approved a citation that did not support the claim. We ended up solving it by making the verifier blind to the original question and forcing it to quote the supporting sentence from the chunk before grading. That kind of plumbing is most of what our AI agents work looks like in practice.

If you have a RAG system in production, open your verifier prompt today and check whether it sees the user's question. If it does, that is a one-line change that will tighten your citations by Monday.

Key takeaway

If your RAG output can reach production without a citation gate, you don't have a RAG agent. You have a confident chatbot in front of a database.

FAQ

Why not just use 512-token chunks like every tutorial recommends?

Legal arguments don't end at token 512. A rechtsoverweging cut in half produces a chunk that cites half a holding. Chunk along document structure, not token count.

How do you stop the model from inventing citations?

Two-pass structure. The drafter writes with citation markers; a separate verifier sees only the draft plus cited chunks (not the question) and grades each claim. Failed claims drop to editor review.

Is the eighteen-month dataset of editor grades reusable, or specific to this publisher?

Specific. The grades encode this publisher's editorial standards. That's why it's a moat. A competitor would need to collect their own.

What did the 12-year-old custom XML archive need that the modern ECLI corpus didn't?

A bespoke parser that handled nested rulings, four date formats, and inline HTML in 312 hand-edited files. About 1.5% of the archive needed manual editor triage.

Why three indexes instead of one normalised corpus?

Failure modes differ. When an answer is wrong, you want to know which corpus it came from. Three indexes also let you score citation recall per corpus and target parser work where it pays.

ragai agentsknowledge basecase studyarchitecturemigration

Building something?

Start a project