RAG

RAG citations: the guardrail we ship after Germany's ruling

A German court has made hallucinated answers the publisher's problem, not the model's. Here is the small server-side check we now run before every RAG response leaves the box.

Jacob Molkenboer· Founder · A Brand New Company· 11 Jun 2026· 6 min

Open index-card drawer with brass divider and cream card sealed in chartreuse wax on ivory linen, leather ledger at left.

It's a Tuesday in May. You push the new RAG help-bot to prod at 14:00, watch the first hundred conversations stream in, and feel the quiet pride of a thing that finally answers the question instead of pointing at the FAQ. By Friday, a screenshot is circulating in your customer's Slack. The bot has confidently invented a refund policy that does not exist. Your support lead asks, with what passes for patience, whose problem this is.

Until last week the honest answer in most Dutch SaaS shops was a shrug and "the model". That answer just stopped working.

What the German ruling actually says

A German court has held Google liable for false statements appearing in its AI Overviews. The legal detail will be picked apart for years, but the shape of the holding is the part that matters to anyone shipping a chat agent: the entity that publishes the generated sentence is the entity that owns the consequences of that sentence. Not the model vendor. Not the retrieval index. The product that put the words on screen.

If you run a customer-facing RAG agent on a .nl domain, treat that as your new operating assumption. Germany is not the Netherlands, but Dutch and German consumer-protection regimes rhyme more than they differ, and the EU AI Act's transparency obligations sit on top of both. Engineering circles noticed inside the week; the room has caught up.

Why "the model said it" is no longer a defense

A RAG stack feels safer than a raw LLM call because there is a retrieval step, and the retrieval step feels like grounding. It is not grounding. It is suggestion. The model reads the retrieved chunks the way a tired junior reads source material at 23:30: mostly accurately, with occasional confident embellishment in the gaps.

In our own production logs across fourteen live agents, the failure mode is almost never "model invents from nothing". It is "model paraphrases three real chunks and adds a fourth sentence that ties them together with a fact that appears in none of them". That fourth sentence is the one that ends up in the screenshot.

The verification gap nobody wants to close

Most teams have shipped one of three patterns. The first ships the raw model output and prays. The second adds a "based on the following context, do not invent" line to the system prompt and calls it grounding. The third asks the model to cite, then displays the citations next to the answer.

None of these check anything. The first two are open-loop. The third is the cruel one: when we audited a client's bot earlier this year, 18% of the inline [1][2] markers pointed at chunks that did not contain the cited claim. The user sees a citation and trusts it. The citation was generated by the same process that generated the hallucination.

The ruling does not care which of the three patterns you shipped. It cares about the sentence that was published.

The three-line guardrail we now ship by default

Every new RAG agent we ship runs a post-generation check before the response leaves the server. The idea is older than the ruling; the urgency is not. The check is not glamorous and not novel. It is the thing every team meant to build and never quite prioritised.

import numpy as np
from nltk.tokenize import sent_tokenize as split_sentences

# embed(text) -> np.ndarray; use your provider's embeddings endpoint.
def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def verify_citations(answer: str, chunks: list[str], embed, threshold: float = 0.78) -> list[str]:
    unverified = []
    for sent in split_sentences(answer):
        if max(cosine(embed(sent), embed(c)) for c in chunks) < threshold:
            unverified.append(sent)
    return unverified

Three lines of real logic, wrapped in a function so we can log it. The full integration is fatter (sentence splitting respects citation markers, the embedding call is batched, the threshold is per-tenant), but the core is the loop above. Every sentence in the answer must have at least one retrieved chunk that resembles it strongly enough to count as a source. If any sentence does not, we have a decision to make, and we make it explicitly.

In practice we route the failed answer through one of three handlers, in order. First, drop the offending sentence and re-render; if the answer still stands without it, ship the trimmed version. Second, if the offending sentence carries the answer, regenerate once with the failed sentence quoted back into the prompt as "do not assert this without a source". Third, if the regeneration also fails verification, return a graceful "I do not have a confident source for this, here is the closest passage we have" response and surface the conversation to a human.

The threshold of 0.78 is not magic. It is what tuned out the long tail of "model glued two chunks together with a third fact" without strangling answers that are paraphrasing one chunk well. Tune it on your own corpus. We re-tune it monthly per client.

Warning

Cosine similarity catches paraphrase, not arithmetic. If your agent says "the discount is 15% so the total is €127.50", the verifier will happily pass a wrong number that paraphrases a right one. Numbers need a separate symbolic check.

What it does not catch, honestly

The guardrail catches the failure mode that produces the screenshot. It does not catch a model that retrieves the wrong chunk and faithfully paraphrases it. It does not catch a model that summarises three correct chunks into a misleading whole. It does not catch tone, bias, or the confidently-helpful answer that is technically grounded and operationally disastrous.

Those need other layers: a retrieval evaluator, an answer-level judge, a small set of red-team probes that run in CI on every prompt change. We run all of them on the agents that touch money. Some model vendors now ship native citation primitives that can shorten the verifier loop (Anthropic's citations feature is the cleanest of the bunch), but the server-side check still belongs to you. The citation verifier defends against the specific failure the German court just made expensive, and it is the cheapest of the four to add.

If you only ship one new check this quarter, ship this one.

The five-minute audit you can run today

Pull your last 200 production conversations. For each assistant message that contains an inline citation, grep the cited chunk for the noun phrases in the cited sentence. If less than 90% of the citations actually contain the claim, your bot is already producing the sentence that the next plaintiff will screenshot. When we built the support agent for a Dutch logistics SaaS earlier this year, this gap was exactly what bit us; we ended up shipping the verifier above as the last server-side hop before the response stream opens to the client. It is one of the boring layers of building AI agents well, and the one that lets the founder sleep.

Key takeaway

If you ship a customer-facing RAG agent, verify every sentence against the retrieved chunks before it leaves the server. The publisher owns the sentence now.

FAQ

What did the German ruling actually decide?

A German court held Google liable for false statements appearing in its AI Overviews. The publisher of the generated sentence carries the risk, not the underlying model vendor.

Will this apply to a Dutch SaaS?

Dutch and German consumer-protection law overlap heavily, and the EU AI Act covers both. Assume the same liability shape until a Dutch case says otherwise.

Does showing citations next to the answer solve the problem?

No. In one client audit, 18% of inline citation markers pointed at chunks that did not contain the cited claim. The citation has to be verified, not just displayed.

What cosine similarity threshold should I use?

We start at 0.78 and tune monthly per corpus. Too high strangles real paraphrase; too low passes confident embellishment. There is no universal value.

ragai agentsarchitecturesecuritystrategy

Building something?

Start a project