RAG

RAG for accountants: every claim cites RJ or Wta first

A 22-person Groningen accountancy wanted a RAG agent that could draft DGA letters. The constraint: every passage had to cite RJ or Wta before a single token shipped.

Jacob Molkenboer· Founder · A Brand New Company· 20 Jun 2026· 10 min

Open wooden index-card drawer with brass divider, ledger stack, chartreuse paper tab, red wax seal on ivory paper.

It is a Tuesday evening in late November. A senior accountant in Groningen is on her fourth concept letter to a DGA, the kind that explains why a holding-BV's deelnemingsvrijstelling does not survive a particular herstructurering. She catches a citation. RJ 216.5. She frowns. RJ 216 covers other-than-current liabilities, not deelnemingen. The model has invented a number that does not exist. Two months later we shipped a RAG agent for her firm that refuses to write a sentence until it has a vetted RJ-bepaling or Wta-artikel in hand. This is the playbook.

The constraint sets the architecture

Most RAG demos optimise for "did it answer the question." Accountancy answers a different question first: can I sign my name under this? A partner at a Dutch accountancy is regulated under the Wet toezicht accountantsorganisaties (Wta) and inspected by the AFM. If the conceptbrief naar de DGA cites a non-existent RJ-bepaling, the cost is not a confused client. It is a sanction file.

That fact rearranges every layer of the stack. Retrieval is no longer "fetch the top-5 chunks." It is "produce a verifiable passage with an enforceable provenance chain." Generation is no longer "draft the letter." It is "draft the letter given the citation budget, and if the budget is empty, decline."

We call this cite-before-write. The model is not allowed to emit a normative claim — a "must," a "may," an "is required to" — without an attached citation that has been verified to exist in the live corpus. Everything else is a tooling problem.

The two corpora are not the same animal

This firm has 22 people, 1,240 inbound client questions per week, and two reference bodies:

19,800 handreikingen, RJ-bepalingen, and Wta articles issued through the NBA and the Raad voor de Jaarverslaggeving. Normative, versioned, well-structured, citable.
A 12-year-old Visionplanner advies-archief: 86,000 internal memos written by ten different accountants in three different in-house styles, partially OCR'd from scanned PDFs after a 2017 office move.

If you index both with the same chunker and the same embedding model, the agent will happily cite an internal memo from 2014 as if it were a normative passage. We treat them differently from the first byte.

The NBA, RJ, and Wta side is the authority corpus. Only passages from here are eligible to satisfy the citation gate. The Visionplanner archive is the experience corpus. It can suggest framings, point at similar historical cases, and surface what the firm did last time, but its passages are never accepted as a citation. They are marked [INTERNAL — non-citable] at retrieval time.

Chunking by bepaling, not by tokens

Token-based chunking is the wrong granularity for normative documents. RJ-bepalingen are already chunked, by the people who wrote them. A bepaling has a stable identifier (e.g. RJ 270.103), a version, and a clean scope. Cutting it at 512 tokens dissolves the only handle that makes citation possible.

We wrote a parser that walks each NBA document and emits one record per addressable unit:

@dataclass
class CitableUnit:
    source: str            # "RJ" | "NBA-handreiking" | "Wta"
    identifier: str        # e.g. "RJ 270.103", "Wta art. 25"
    version_year: int      # the published edition
    superseded_by: str | None
    text: str              # the bepaling, verbatim
    parent_path: list[str] # ["RJ 270", "Opbrengsten", "Dienstverlening"]
    source_url: str        # canonical, deep-linked

The superseded_by field matters. A 2014 advies that cites RJ 271 (Personeelsbeloningen) is not wrong because of the citation. It is wrong because that paragraph was rewritten in 2019. We carry both editions and flag the diff at retrieval time. The agent is allowed to cite the current edition only, unless the question is explicitly historical.

The citation gate

The gate is a small, boring filter that sits between retrieval and generation. It is the most important component in the system.

def citation_gate(passages: list[CitableUnit], question: str) -> GateResult:
    eligible = [p for p in passages
                if p.source in {"RJ", "NBA-handreiking", "Wta"}
                and p.superseded_by is None]

    if not eligible:
        return GateResult(
            allow_draft=False,
            reason="no_normative_passage",
            fallback="ask_human_partner",
        )

    if not any(score_relevance(p, question) > 0.62 for p in eligible):
        return GateResult(
            allow_draft=False,
            reason="weak_relevance",
            fallback="ask_clarifying_question",
        )

    return GateResult(allow_draft=True, citations=eligible[:4])

The relevance score is a small cross-encoder reranker fine-tuned on 1,800 historical (question, cited-bepaling) pairs from the firm's own archive. The 0.62 cutoff was learned, not picked. More on that in the evaluation section.

Takeaway

If you cannot show the agent a passage it is allowed to cite, the right behaviour is not "answer anyway." It is "decline and ask for a partner." A RAG agent that knows when to shut up is worth more than one that drafts confidently from nothing.

Hybrid retrieval, then a real reranker

Embeddings alone misfire on Dutch accountancy text because the lexicon is so dense. "Deelneming" means one thing in RJ 214 and a near-opposite thing in colloquial business Dutch. We run BM25 over the bepaling text and a multilingual embedding index in parallel, then merge with reciprocal rank fusion.

The reranker is where the real work happens. We fine-tuned a small cross-encoder (around 110M params) on triples extracted from the firm's archive: (client question, cited bepaling, three near-miss bepalingen). Eight years of advisory correspondence give you enough signal to teach a model the difference between "deelneming als bedoeld in RJ 214" and "deelnemingsvrijstelling als bedoeld in artikel 13 Wet Vpb 1969."

That reranker rescues about 11% of queries where the embedding index puts the right bepaling at rank 7 or below. Without it the citation gate would reject too many questions and the partners would lose patience.

Structured output with citation slots

Generation does not produce free prose. It produces a structured object with citation slots that the renderer expands into the conceptbrief.

{
  "salutation": "Geachte heer Van Dijk,",
  "paragraphs": [
    {
      "text": "De voorgenomen herstructurering raakt de waardering van de deelneming in {{client.holding_name}}.",
      "citations": ["RJ 214.302", "Wta art. 25"]
    },
    {
      "text": "Onder de huidige verslaggevingsregels dient de deelneming op nettovermogenswaarde te worden gewaardeerd.",
      "citations": ["RJ 214.305"]
    }
  ],
  "open_questions": [],
  "non_citable_context": ["VP-archief 2019-Q3, advies #4471"]
}

If a paragraph has zero entries in citations, the renderer drops it. The agent's prompt makes this explicit and the schema enforces it. Schema-shaped generation is the cheapest way to keep a model honest.

Evaluation: a golden set of 312 historical letters

We took 312 conceptbrieven from the past three years that had been reviewed and signed by partners, and stripped them down to (client question, paragraph, citation) triples. That gave us 1,847 ground-truth assertions.

For each, we ran the pipeline and measured three things:

Citation recall: does the agent surface at least one bepaling the partner actually used?
Hallucinated citation rate: does the agent ever emit an identifier that does not exist in the corpus? Target: zero.
Decline rate: how often does the gate refuse to draft? This is not a failure metric. It is a calibration knob.

The first run came in at 71% citation recall, 2.4% hallucinated citation rate, 18% decline rate. The hallucination figure was disqualifying. The fix was not a bigger model. It was a deterministic post-check that resolved every emitted identifier against the corpus index and dropped the paragraph if the lookup missed. After that pass, the hallucinated citation rate sat at 0% across the full evaluation set. Recall climbed to 84% after we added the cross-encoder reranker.

The Visionplanner archive: archaeology, not search

The 12-year-old advies-archief was the hardest part of the project. Three quarters of it lives inside Visionplanner's document store. The rest is loose PDFs from a long-gone shared drive. Roughly 14% of the PDFs were scanned, not born-digital, and the OCR was done in 2018 with a tool that confused "ƒ" (the old guilder symbol, still sprinkled through pre-2002 amendments) for "f" and silently dropped diacritics.

We rebuilt the OCR pass with a current model, reattached client metadata by matching dossier IDs from filename patterns, and ran a deduplication pass that collapsed 86,000 documents down to 41,300 distinct advies-units. Then we labelled each unit with the bepaling(en) cited in its body, using a small classifier trained on the partners' own historical taxonomy.

The output of that labelling is what makes the experience corpus useful. When a 2026 client asks the same question someone asked in 2017, the agent can surface the old answer with the bepalingen it used, then check whether those bepalingen have been superseded. About 6% of historical matches now flag a superseded citation. Those are the ones the partners want to see.

Throughput and the partner-in-the-loop pattern

1,240 inbound questions per week is not a model-latency problem. Most of those questions get answered by a junior, and the agent's job is to draft the first version of the conceptbrief and surface the citations. The partner reviews and signs.

We sized the system for two SLAs: a draft in under 35 seconds for 95% of cases, and a same-day partner-review queue. The agent handles around 78% of inbound questions end-to-end to draft stage. The remaining 22% either trip the citation gate or get routed straight to a partner because they touch domains the system explicitly does not own.

That last bit matters. The system has a deny-list of question types it will not touch even if it has citations: anything that looks like belastingadvies in domains the firm does not file, anything involving a controleopdracht where independence rules are in play, and anything where the client question contains a date in the future further out than 180 days. Those go straight to a human.

What we got wrong on the first pass

We initially fed the agent every NBA-handreiking back to 2002. It performed worse, not better. Older handreikingen contain language patterns the embedding model picked up and reused in places where the modern guidance was clearer. Restricting the authority corpus to current-edition bepalingen plus explicitly historical context lifted recall by four points.

We also tried letting the agent draft directly into Dutch from the start. It produced fluent text and subtly wrong terminology. Drafting into a structured intermediate, then rendering, gave us a place to validate before any prose hit the screen.

The smallest thing you can do today

If you run a knowledge-heavy advisory practice and you are thinking about RAG, do one audit this afternoon: take ten of your last conceptbrieven, pull the citations, and check whether each one still points at the current edition of the cited bepaling. If even one is stale, you have just measured the hallucination floor of any naive system you build on top.

When we built this for the Groningen firm, the thing we ran into was not retrieval quality. It was the gap between "the model wrote something plausible" and "a partner can sign it." We solved it by treating the citation as the unit of correctness and refusing to generate without one. If you are looking at similar work, our notes on building AI agents for regulated practices are the closest match.

Key takeaway

Treat the citation as the unit of correctness. If the agent cannot show a vetted passage, the right behaviour is to decline, not to draft.

FAQ

Why keep the authority corpus and the internal archive in separate indexes?

So the agent can never cite an internal memo as if it were normative. Two indexes plus a source filter at the gate is the cheapest way to enforce that boundary.

Can the agent draft without a verified citation?

No. If retrieval returns no eligible passage above the relevance cutoff, the gate blocks generation and routes the question to a partner or asks a clarifying question.

What happens when a cited bepaling is superseded?

Each citable unit carries a superseded_by field. The agent is restricted to current-edition passages unless the question is explicitly historical, and surfaces the diff when a historical match resurfaces.

How do you guarantee zero hallucinated citation identifiers?

A deterministic post-check resolves every emitted identifier against the corpus index after generation. Any paragraph whose citation does not resolve is dropped before the draft reaches a human.

ragai agentsknowledge basecase studyarchitectureoperations

Building something?

Start a project