← Blog

RAG

RAG for field monteurs: the Roermond citation playbook

It is 11 a.m. in Venlo. A monteur stands at the foot of a 22-tonne overhead hijskraan, fault code E-407 on her tablet, nine minutes before the shift changes.

Jacob Molkenboer· Founder · A Brand New Company· 21 Jun 2026· 10 min
Open oak index-card drawer on ivory desk, one cream card raised by brass tab with green ribbon, leather logbook and brass key beside it.

It's a Tuesday in March, just past 11:00. A monteur — call her Saskia — is standing in a Venlo distribution shed at the foot of a 22-tonne Demag overhead hijskraan. Her tablet shows fault code E-407 on the DC-Pro controller. She has nine minutes before the next pick window opens. The wholesaler in Roermond who employs her has 47,200 EPLAN schematics, a Trimergo T2 service archive that has been growing since 2015, and somewhere in there is the exact paragraph she needs from NEN 3140 about working on a live B-circuit before lockout.

Until last year she would phone the office. Maaike or Bram in the technische dienst would dig for fifteen minutes, sometimes longer. Now she types the question into a Telegram bot — a retrieval-grounded RAG agent we shipped last spring — gets an answer in under eight seconds, and every passage on screen has a clickable link back to the source paragraph. The bot has been wrong twice in the last 90 days. Both times it refused to answer — which is the only kind of wrong we can live with.

This is how we built it.

The corpus we inherited

The wholesaler is 24 people, around €11M in turnover, services about 380 industrial sites across Limburg and Noord-Brabant. They handed us three piles.

  • 47,200 EPLAN schematics exported as PDF/A from EPLAN Electric P8, dating back to 2011. Most have OCR'd text layers; about 6% are scanned reprints with no usable text.
  • 12,800 IEC 60204 and NEN 3140 clauses, from licensed NEN Connect XML exports, segmented into article + sub-article + opmerking blocks.
  • ~218,000 service notes from Trimergo T2, the ERP/service system the company has run since 2015. Each note is a free-text monteurverslag plus a structured fault code, machine ID, and customer ID.

And one hard rule from the directie: no answer leaves the system without a citation a monteur can verify on the spot. If the model wants to say "lock out the B-rail and bridge the safety relay," it must point to the paragraph that said so. No paragraph, no answer.

Why retrieval, not a fine-tune

People ask us this every month. Why not fine-tune a model on the corpus and skip the retrieval layer? Three reasons we keep coming back to.

First, the corpus changes. Manufacturer bulletins land weekly. NEN updates clauses on its own cycle. A fine-tune freezes a snapshot; retrieval reads what is there tonight.

Second, citations are not a UI feature here. They are a regulatory feature. NEN 3140 requires that the elektrotechnisch verantwoordelijke can show which clause justified a working procedure. A fine-tuned model can quote a clause from memory. It cannot prove the quote came from the licensed copy you actually have on file.

Third — and this is the one nobody likes — fine-tuned models hallucinate confidently. Retrieval lets us cap the answer at the passages we found. If we found nothing, the agent says so.

Takeaway

For high-consequence field work the question is not "how do we make the model smarter?" It is "how do we make sure it cannot answer without proof?"

The ingestion pipeline

The pipeline runs nightly on a single Hetzner AX52. Total cold ingest of all three corpora took 4 days 11 hours the first time. Incremental nightly runs finish in 23 minutes.

The shape, in pseudo-code:

# ingest.py — runs nightly, idempotent on content hash
import hashlib

def ingest(source):
    written = 0
    for doc in source.iter_documents():
        h = hashlib.sha256(doc.bytes).hexdigest()
        if db.exists(content_hash=h):
            continue
        blocks = segment(doc)              # source-specific
        for block in blocks:
            block.embed = embed_model(block.text)
            block.citation = build_citation(doc, block)
            db.upsert(block)
            written += 1
    return written

The interesting work is in segment and build_citation, and it is different for every source.

EPLAN schematics

EPLAN PDFs are tabular and dense. Naive page-chunking destroys context. We parse each schematic with a per-template extractor — Demag, Siemens S7-1500, Wago 750, Phoenix Contact, and so on — that recognises the title-block and the function-group labels (=K01+S04-Q12 style). Each function group becomes a block. The citation is the schematic ID plus the function-group address, which monteurs already know how to read out loud.

NEN and IEC clauses

We licensed the NEN Connect XML exports. This is non-trivial paperwork; budget two months. Each clause becomes a block with its full hierarchical path as the citation: NEN 3140:2018 §6.3.2 b. We do not summarise clauses. We store them verbatim.

Trimergo T2 service notes

This is the messy one. Eleven years of monteurverslagen, written in three regional Dutch dialects, with abbreviations like kk vk (kortsluiting vermogensschakelaar) that nobody outside the firm understands. We built a 412-entry glossary the indexer applies before embedding, and we deliberately keep the original verbatim text as the citation so the receiving monteur sees the words his collega actually wrote in 2017.

Warning

If your corpus has internal jargon, build the glossary before you embed, not after. Re-embedding 218k notes because you discovered "kk vk" three weeks in is a bad day.

The retrieval shape

One vector store is not enough. We run hybrid retrieval — BM25 plus dense embeddings in Qdrant — and we route by intent first.

A small classifier (DistilBERT, fine-tuned on 1,800 hand-labelled monteur questions) tags each query with one or more of: safety, schematic, history, procedure. Each tag fans out to a different index with its own k and its own rerank threshold. A safety query searches NEN/IEC first and only consults Trimergo if the safety pass returned nothing. A history query ("hebben we deze E-407 op deze kraan eerder gehad?") goes straight to Trimergo filtered by machine ID.

The reranker is a Cohere Rerank 3 call over the top-50 candidates. We keep the top 6 above a score threshold of 0.42. If nothing crosses 0.42 the agent refuses. That threshold was the single most impactful number we tuned — half a point lower and the citation precision drops off a cliff.

The grounded answer contract

This is the part nobody writes about and it is the part that makes the difference.

We do not let the LLM write a free-form answer with citations sprinkled in. We make it fill a contract. The prompt to the answering model (Claude Sonnet 4.5 by default, with a self-hosted Llama 3.1 70B fallback when we hit rate limits) looks roughly like this:

You are answering a question from a field monteur.

You have been given N passages, each with an ID and a citation.
You may ONLY use facts that appear in these passages.

Return JSON of this shape:
{
  "answer_nl": "...",          // Dutch, <=120 words, imperative voice
  "steps": [                    // ordered procedure if applicable
    { "do": "...", "from": "P3" }
  ],
  "citations": ["P1","P3","P4"],
  "refuse": false,              // true if passages are insufficient
  "refuse_reason": null
}

If any step would normally require a citation and you cannot find one
in the passages, set refuse=true and explain what is missing.

Each step references a passage ID. The frontend renders the steps with the passage text expanding underneath. A monteur can read the answer and tap the source in the same gesture. If refuse is true, the bot tells her exactly which kind of source it could not find and routes the question to whoever is on call at the office.

Evaluation: the 312-question gold set

We sat with the three senior monteurs for two afternoons and built a 312-question gold set covering the long tail. Each question has a known-correct answer, the citation that justifies it, and a known-correct refusal for the cases where the corpus genuinely does not contain the answer.

We run the gold set on every pipeline change. The metrics we watch:

  • Citation precision. Of the citations the agent emits, what fraction actually contain the claim. Current: 0.97.
  • Refusal recall. When the corpus does not support an answer, does the agent refuse? Current: 0.99. This is the metric we will not trade off.
  • Answer usefulness. A human rates the answer on a 1–5 scale. Current average: 4.3, up from 3.6 at launch.

We do not look at BLEU or ROUGE. We do not look at perplexity. A monteur cares whether the answer is right and whether she can verify it. Nothing else.

Why constrained beats unconstrained

Two winters ago, a small-claims tribunal in British Columbia held Air Canada responsible for a bereavement-fare policy its chatbot had invented out of thin air. The airline argued the bot was a separate legal entity. The tribunal disagreed. The case keeps getting cited because the failure mode is universal: a confident generated answer, no source the customer could check, and a company on the hook for whatever the model said.

We think about field tooling the same way. A monteur at the foot of a hijskraan is not a critic of language models. She needs the answer to be right or to be honestly absent. Unconstrained generation cannot promise that. A retrieval pipeline with a hard citation contract can.

The lesson is the same in both contexts: where the cost of a confident wrong answer is high, the system has to be allowed to say nothing.

What we would do differently

Two things, in hindsight.

First, we would build the glossary before we touched the embedding model. We re-embedded 218k Trimergo notes in week three after we discovered the reranker was scoring kk vk entries near zero because the embedding model had no idea what the abbreviation meant. Costly in euro, more costly in calendar time.

Second, we would ship the refusal UX on day one. We launched with a refusal flow that said "I don't have enough information." Monteurs treated it as a bug and stopped using the bot. The current UX says exactly which kind of source is missing — "I have a manufacturer bulletin but no NEN clause to support step 3" — and routes the question to a human with the context attached. Adoption jumped from 41% to 78% of weekly monteur-vragen in eight days.

When we built the monteur-agent for the Roermond wholesaler, the thing we kept running into was that "good enough" and "verifiable" are different problems, and most teams ship the first one. Solving the second is the AI agents work we do. If your field people are still phoning the office for procedure citations, you already have the dataset you need.

Open your service archive tonight. Pull a random fifty notes. Could a monteur in the field verify, from those notes alone, that the procedure inside them is the current one? That is the size of the problem you are solving for.

Key takeaway

Where the cost of a confident wrong answer is high, the system has to be allowed to say nothing — and prove every word it does say.

FAQ

Why not just fine-tune a model on the corpus?

Fine-tunes freeze a snapshot, cannot prove provenance for NEN compliance, and hallucinate confidently. Retrieval caps the answer at what the system actually found and lets it refuse cleanly.

What is the rerank threshold and why does it matter?

Cohere Rerank 3 score 0.42 over the top 50 candidates, top 6 kept. Below 0.42 the agent refuses. Half a point lower and citation precision falls off a cliff.

How long did the build take end to end?

About 14 weeks from kickoff to production. The licensing paperwork for NEN Connect XML was the long pole, not the engineering. Plan for it early.

What runs in production?

One Hetzner AX52 for ingestion and BM25, Qdrant for vectors, Claude Sonnet 4.5 as the answerer with a self-hosted Llama 3.1 70B fallback, Cohere Rerank 3, Telegram as the monteur interface.

Do you ever let the agent answer without a citation?

No. The grounded contract forbids it. If the passages do not support a step, the model is required to set refuse=true and name the kind of source that was missing.

ragai agentsknowledge basecase studyarchitectureoperations

Building something?

Start a project