RAG

RAG freshness gates: a withdrawn arrest in a legal brief

At 16:47 on a Thursday, a Maastricht paralegal spotted a withdrawn Hoge Raad arrest in a 14-page conclusie draft. The bulk feed had dropped one field.

Jacob Molkenboer· Founder · A Brand New Company· 22 Jun 2026· 8 min

Half-open oak index-card drawer with one cream card lifted, green tab on top, brass divider, ledger stack, wax seal.

It was 16:47 on a Thursday in May when a paralegal at a 28-person Maastricht advocatenkantoor noticed something off in a conclusie draft. A Hoge Raad arrest cited on page nine — ECLI:NL:HR:2019:1278 — had been withdrawn months earlier. The brief was due Monday. The associate handling it had already started clearing his desk.

The cite hadn't come from a junior. It came from the firm's RAG agent, which had been pulling jurisprudentie for six months without incident. The Slack message to us came in at 16:53. By 18:20 we knew the cause. By the following Tuesday, every legal draft on that system had to clear a new gate before the agent could call it done.

What the bulk feed quietly did

Rechtspraak.nl publishes a bulk feed of Dutch jurisprudentie under their Open Data programme. Every uitspraak ships as an XML record with a stable ECLI identifier, the full text, and a set of metadata velden — including the intrekkings-marker that flags whether an arrest has been withdrawn, vernietigd in revision, or otherwise stripped of authority.

That marker is what tells the rest of the world: do not cite this. For six months we'd been ingesting the feed, embedding the body text, and storing the marker as a sidecar flag. Any chunk with the marker set was excluded from retrieval at query time. Clean separation.

On 2026-05-04 the feed shipped a record for ECLI:NL:HR:2019:1278 in which the intrekkings-marker field was present but empty. Not absent — empty. Our ingestion script treated the empty string as "not withdrawn" and overwrote the previously-set flag. The chunk became retrievable again. Six weeks later it surfaced on a semantically adjacent query about cassatie-grenzen, scored a 0.84 cosine similarity, and ended up in paragraph 27 of a brief that was about to walk into court.

What made it harder to spot is that the case body was unchanged. The text of the arrest was still correct. The intrekkings-marker is a single Boolean-ish field at the periphery of the record, and the embedding model has no reason to care about it. The signal lives entirely in the metadata, and the metadata is exactly the part that no human ever looks at unless something has gone wrong.

Warning

Upstream silent schema drift is the most common cause of stale RAG output we see. An empty string is not the same as "no change." Idempotent overwrites are how withdrawn data gets resurrected.

Why our existing RAG hygiene didn't see it

The firm's RAG stack was, by any reasonable benchmark, well-built. Weekly re-embedding of the full corpus. Source attribution on every passage. A guardrail that refused to cite anything without an ECLI. A second pass that flagged any citation older than five years for human review. The associate had reviewed seven citations that morning. He had not flagged this one because it was from 2019 — comfortably inside the window — and the agent's confidence score was high.

What the stack didn't have was a per-citation liveness check at draft-emit time. The check we ran was at ingestion. Once a passage was in the vector store with its sidecar metadata, we trusted that metadata. The single point of failure was the ingestion script, and it failed silently on one record six weeks before anyone noticed.

The forensics took about ninety minutes. We pulled the ingestion logs, found the 4 May run, diffed the record against the previous ingestion of the same ECLI, and watched the intrekkings-marker flip from true to empty string. We then went to Rechtspraak.nl directly and confirmed the live record still showed the case as ingetrokken. The bug was entirely on our side, in the four lines of ingestion code that mapped the XML field onto our sidecar metadata. The associate's draft had been compromised by a missing null check.

This is the part that bothers us. The error wasn't a hallucination. The model didn't invent the cite. It correctly retrieved a real ECLI, with a real passage, that genuinely matched the legal question. The pipeline had quietly told it the passage was valid when it wasn't.

The citation-freshness gate

The fix is a gate that sits between the agent's draft and the human reviewer. Before any paragraph containing a jurisprudentie reference is allowed to leave the agent loop, every ECLI in it is checked against the live Rechtspraak.nl record. Not the cached one. The live one.

The gate does three things:

Extracts every ECLI:NL:* reference from the draft using a strict regex.
For each one, fetches the canonical XML record from data.rechtspraak.nl with a hard 800ms timeout.
Checks the intrekkings-marker, the procesgang status, and any "vernietigd" annotation. If any of those resolve to a withdrawn or overruled state, the draft is blocked and a structured error surfaces to the agent and to the reviewer.

The gate runs after the draft is composed and before the agent returns a finished artifact. It is not a soft warning. The agent cannot self-override it.

What the gate looks like in code

Here is a stripped-down version of the validator. We run it as a separate worker so a slow Rechtspraak response doesn't tie up the agent thread.

import re
import asyncio
import httpx
from dataclasses import dataclass

ECLI_PATTERN = re.compile(r"ECLI:NL:[A-Z]{2,4}:\d{4}:\d+")
RECHTSPRAAK = "https://data.rechtspraak.nl/uitspraken/content"

@dataclass
class CitationStatus:
    ecli: str
    live: bool
    reason: str | None = None

async def check_one(client: httpx.AsyncClient, ecli: str) -> CitationStatus:
    try:
        r = await client.get(RECHTSPRAAK, params={"id": ecli}, timeout=0.8)
        r.raise_for_status()
    except (httpx.TimeoutException, httpx.HTTPError) as e:
        # Fail closed: if we can't verify, we don't ship.
        return CitationStatus(ecli, live=False, reason=f"unverifiable: {e}")

    body = r.text
    if "<intrekking>" in body or "ingetrokken" in body.lower():
        return CitationStatus(ecli, live=False, reason="ingetrokken")
    if "vernietigd" in body.lower():
        return CitationStatus(ecli, live=False, reason="vernietigd in revision")
    return CitationStatus(ecli, live=True)

async def gate(draft_text: str) -> list[CitationStatus]:
    eclis = sorted(set(ECLI_PATTERN.findall(draft_text)))
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*(check_one(client, e) for e in eclis))
    return results

Two design choices worth flagging. First: the gate fails closed. If Rechtspraak.nl is unreachable, the draft does not ship. We would rather hold a brief for ten minutes than ship a stale citation. Second: the gate's output is a structured list, not a yes/no. When a citation is blocked, the reviewer sees the ECLI and the reason and can either accept the agent's draft with that paragraph removed or rewrite the passage by hand.

Latency, in production, has been fine. The average draft has 6–14 ECLI references. The gate runs them in parallel and adds 220ms to median draft time. The reviewer doesn't notice.

The other three things we changed

A gate at the boundary is the first line of defence, not the only one. We made three more changes that week.

First, ingestion now treats missing-or-empty as missing-not-empty. The intrekkings-marker has three valid states: set to true, set to false, or genuinely absent. Empty string is no longer accepted as "false"; it is logged as a parse error and the previous value is preserved. The ingestion script grew an explicit state machine for marker fields, which we should have written eighteen months ago.

Second, every ingestion run now writes a delta report: how many records changed marker state since the last run, and any record whose marker flipped from withdrawn back to active without a corresponding Rechtspraak announcement triggers a hold. In the past six weeks the report has flagged two more cases. Neither was a real reversal. Both were upstream parse artifacts.

Third, we added a weekly cross-check against a second, independent source. For Dutch jurisprudentie that means kennisbank.overheid.nl. Two sources disagreeing on the status of an ECLI does not auto-resolve; it raises a ticket. For RAG systems that produce regulated output, two-source verification at the corpus level is cheap insurance.

Takeaway

A vector store is a cache. The source of truth lives somewhere else. Any agent that forgets that distinction will, eventually, ship something stale with full confidence.

The cost of agentic confidence

There is a wider point here, and Anthropic's write-up on building effective agents circles the same observation: the more autonomy you give an agent, the more its individual mistakes look like the work of a careful human. The legal RAG didn't fail in a way that screamed for attention. It produced a calm, well-formatted, plausible draft with a real citation to a real case. The only signal that anything was wrong was external — a paralegal's memory of a withdrawal announcement.

Reliable agentic systems are not the ones that are clever. They are the ones with hard, deterministic checks at every point where their output crosses a boundary into the real world. For a legal agent, the boundary is the moment a draft leaves the loop. For an inbox-triage agent, it is the moment a reply gets sent. For a billing automation, it is the moment money moves. At each of those boundaries you want a gate that does not depend on the model's judgment. That is the through-line of Anthropic's argument as well: keep the patterns simple, and verify at every external action rather than trusting the model to verify itself.

This is, in our experience, the failure mode that worries us most. Agents do not fail loudly. They fail in ways that look like competent work. A 14-page conclusie with one bad cite reads identically, to a busy human, as a 14-page conclusie with all good cites. The only honest answer is to assume the agent will be wrong somewhere, at some point, and to put the check in a place where the wrongness cannot escape.

It is also why the broader conversation around identity controls for agent-driven actions is worth watching. The industry is, slowly, moving toward stronger verification on high-stakes agentic capabilities. "An agent did it" is not yet an acceptable answer when the action has real-world consequences, and the systems we ship should not pretend otherwise.

The five-minute audit

If you run any RAG agent that produces output a human will sign their name to — a legal brief, a medical summary, a financial recommendation, a customer-facing reply — ask one question. Between the moment the model finishes drafting and the moment the draft becomes an artifact, is there a deterministic check against the source of truth? Not a soft warning. Not a confidence score. A check that can refuse to ship.

The mistake we made for six months was assuming that a check at ingestion time was equivalent to a check at output time. It is not. Ingestion checks defend against bad data getting in. Output checks defend against bad data getting out. They are different problems and they need different gates. If your RAG has only the first, your agent is one upstream schema change away from a brief like this one.

If the answer to the audit is no, write the check today. Start with a regex over the draft for whatever the citation pattern is — ECLI, CVE, SKU, ISO number, internal ticket ID — and a single HTTP call per reference. Fail closed. You can refine it later. The version we shipped to the Maastricht firm on the Tuesday after the incident was less than eighty lines.

When we built the Rechtspraak-aware RAG agent for that advocatenkantoor, the thing we ran into was that ingestion-time hygiene is not the same as emit-time hygiene. We ended up solving it by treating every external citation as untrusted until validated against its live source, every time. The vector store is a cache, not a court of record.

Key takeaway

A vector store is a cache, not a court of record. Every citation an agent emits needs a live check against its source of truth before the draft leaves the loop.

FAQ

Why didn't the RAG agent's confidence score catch the stale citation?

Confidence scores rate semantic relevance, not legal validity. The agent retrieved a real case that genuinely matched the query. It just happened to be a withdrawn one. Liveness has to be checked separately, against the source of truth.

Could you cache the citation-freshness checks?

We cache results for fifteen minutes, never longer. The whole point of the gate is to catch upstream changes the corpus missed. A long cache would reintroduce the exact failure mode we just fixed.

Does this apply to non-legal RAG, like medical or financial?

Yes. Any domain where the source of truth changes after ingestion needs an emit-time check. Drug recalls, CVE updates, withdrawn ISO standards, deprecated APIs. Same pattern, different regex, same fail-closed rule.

What if Rechtspraak.nl is down when the agent tries to validate?

The gate fails closed. The draft is held and the reviewer sees an explicit error. We would rather block a Monday-morning brief for an hour than ship a citation we couldn't verify against its source.

ragai agentsknowledge basecase studyarchitectureoperations

Building something?

Start a project