RAG

Citation-first RAG: a playbook for legacy ERP retailers

A trade counter in Antwerp. A contractor on the phone. A 2011 boiler manual nobody has time to read. Two years ago, six minutes. Today, four seconds and a citation to page 23.

Jacob Molkenboer· Founder · A Brand New Company· 13 Oct 2025· 9 min

Open leather manual on ivory desk, brass ribbon across page, green index tab, twine-tied ledger cards, wax fragment.

It is 10:47 on a Tuesday in Antwerp. A verkoper on the trade counter has a contractor on the line asking whether the Atag Q38S boiler will accept a 100/150 concentric flue, or if it needs the older 80/125. The manual is a 47-page PDF from 2011, in Dutch, scanned at 200 dpi, sitting on a supplier extranet behind a login that times out every twenty minutes. Two years ago, that question cost six minutes and a colleague. Today, the verkoper types it into a chat box and gets the answer in four seconds, with a link to page 23 of the PDF.

The retailer has 33 people, 38,000 OEM handleidingen across roughly 180 leveranciers, a Navision ERP that was last upgraded in 2010, and a sales team that fielded 1,420 questions like this one last week alone. Here is how we shipped the citation-first RAG agent that answers those questions, and the rule that kept us from breaking the rest of the business while doing it.

The rule before the architecture

Every passage that touches a verkoper's screen must point back to a leverancier-PDF the operations lead has personally signed off on. Every. Single. One. If the system cannot cite, it does not answer. It says "I don't have a vetted source for this" and routes the question to the human queue.

This is not a guard rail bolted on at the end. It is the load-bearing wall. The temptation when you build RAG on top of an ERP and a 38,000-file corpus is to let the model "blend" knowledge: a bit of catalog data, a bit of manual, a sentence of reasoning, a confident answer. That is how you ship a system that, six weeks in, tells a contractor a 100/150 flue is supported on a boiler that explicitly disallows it on page 24. The contractor installs it. Something leaks. You get a call.

Warning

An agent that cannot point at the exact paragraph it is paraphrasing should never be allowed to write that paraphrase into a product page, a pick instruction, or a quote PDF. Read-only first. Always.

The corpus before the model

We spent the first three weeks on the PDFs, not the LLM. Of the 38,000 files, around 11,000 were scanned (no embedded text), about 2,800 were password-protected with a per-supplier password that nobody had documented, and roughly 600 were corrupt or 0-byte placeholders from a long-dead migration script.

We did three things, in this order.

First, we hashed every PDF and deduplicated. The corpus dropped to 24,100 unique files. The other 14,000 were renamed copies, the same manual sitting under three different leverancier folders because two suppliers had been bought by a third.

Second, we ran the scans through a layout-aware OCR pass. Not a generic Tesseract job. We used a layout-aware parser that preserved tables, because most of the verkoper-vragen are about a number in a table: bore diameter, kW rating, NEN-classification, compatible part SKU. A flat text dump loses the row-and-column relationship, and the agent then hallucinates which kW belongs to which model.

Third, we built a vetting workflow for the operations lead. Every supplier folder got a one-page review: is this the current revision, are these PDFs the authoritative source, are there any superseded manuals to quarantine. She vetted 180 leveranciers in eleven afternoons. We tagged each PDF with vetted_at, vetted_by, and revision. Nothing without those three fields is allowed into the retrieval index.

Chunking that respects the document

The default RAG advice (split by 500 tokens, overlap 50) is wrong for technical manuals. A table that spans 14 rows gets cut in half. A warning box that says "do not use with…" gets separated from the model number it applies to. The model then retrieves the warning without the context that triggers it.

We chunked by layout block, then by semantic section, and only fell back to token count for the long prose. Each chunk carries the section heading, the page number, the table caption if it is inside a table, and the PDF revision.

def chunk_block(block, doc):
    return {
        "text": block.text,
        "doc_id": doc.id,
        "page": block.page,
        "section": block.nearest_heading(),
        "table_caption": block.table_caption,  # None if not a table
        "bbox": block.bbox,                    # for the citation viewer
        "revision": doc.revision,
        "vetted_at": doc.vetted_at,
        "supplier": doc.supplier,
    }

The bbox matters more than it looks. When the verkoper clicks the citation, we open the PDF at that page with the exact bounding box highlighted. They see the paragraph, not a vague "page 23." That single feature did more for trust than any other choice we made.

Retrieval that is boring on purpose

Hybrid search. BM25 plus a vector index plus a re-ranker. Nothing exotic.

The interesting part is the filter chain. Before the model ever sees a chunk, we filter by:

Supplier scope. The verkoper's question mentions an Atag boiler; we don't want chunks from Vaillant manuals in the top-k.
Revision currency. If a supplier has a 2023 revision of the same manual, the 2011 revision drops out unless the verkoper explicitly asks for legacy stock.
Vetting status. No vetted_at, no retrieval, no exceptions.

Supplier scope sounds trivial. It is not. The Navision SKU often points to a "house brand" that is actually rebadged from a real OEM. A SKU starting with KV- is a Karver, but Karver is Atag in a blue box. We built the SKU-to-real-OEM mapping by parsing Navision's purchase order history, not its product master, because the product master had been hand-edited too many times to trust.

The Navision bridge that does not write

Navision 2009 is not friendly. There is no clean API. The integration path the agent uses is a read-replica of three tables exported nightly to Postgres: Item, Item Variant, and the last 36 months of Sales Line. That is enough to answer questions like "did we sell this part to this contractor last year, and at what price."

The agent reads. It does not write. Not one column. The temptation, when you have a working RAG layer that knows the catalog deeply, is to let it propose product page updates or fill in missing OEM specs on the productpagina. We refused, for two reasons.

One: the productpagina is downstream of Navision in this shop. Anything we wrote there would be overwritten by the next nightly sync. Two: the moment an agent can write to the catalog, every retrieval bug becomes a data-integrity bug. We kept the read path and the write path on different sides of a fence, and the fence is enforced at the network layer. The retrieval service has no credentials for the Navision write API, full stop.

The WMS that the agent never touches

Same rule, harder application. The warehouse runs a pick-instructie queue driven by Navision sales orders. A salesperson asked, reasonably, if the agent could auto-generate the picker note when a complex order goes in: "remember the gasket kit, the manual says it ships separately."

No.

The picker reads what the system tells him to pick. If the agent generates that text and gets it wrong, the picker has no way to catch it, because the picker is not the domain expert. So we built the inverse: the agent flags the order for the verkoper, who reads the manual citation, who writes the picker note in Navision the way they always did. The agent saves the verkoper four minutes per flagged order. It does not skip the human.

The eval set that ops built, not us

The operations lead and two senior verkopers wrote the eval set. Three hundred and twelve questions, each with a ground-truth answer and a ground-truth PDF page. We ran the agent against this set on every meaningful change. The number we cared about was not "answer correctness" in the abstract. It was citation precision: when the agent cited page 23 of manual X, was the answer actually on page 23 of manual X.

Starting baseline: 71% citation precision. We pushed it to 94% over six weeks. The remaining 6% are honest ambiguities (the answer is on page 23 and 24, the agent cited 24, the eval insists on 23), and we stopped chasing them because the verkopers said the system was already faster than they were.

What broke

The thing that surprised us was language. The PDFs are in Dutch, French, German, English, and occasional Italian. Verkopers ask in Dutch but quote the customer in whatever language the customer used. Our first re-ranker was English-tuned and quietly de-prioritised the Dutch chunks when the question contained an English brand name. Citation precision dropped 11 points overnight when we changed brands. We swapped to a multilingual re-ranker and added a query-language detector that boosts chunks in the verkoper's working language. Fixed.

The second surprise was the supplier extranets. Three large suppliers update their PDFs without changing the filename. Our nightly sync was happy: same hash, no work to do. Except the hash had changed because the content had changed, and our hash function was hashing a header timestamp the supplier embedded. We moved to content-only hashing (strip metadata, then hash) and caught 340 silently-updated manuals in the first week.

The smallest thing you can do today

If you are sitting on a corpus of supplier PDFs and a sales team that re-asks the same questions every week, do not start with the model. Start with the operations lead and a spreadsheet. List every supplier. Tick the ones whose manuals are the authoritative source. Quarantine the rest. That spreadsheet, vetted by a human who knows the business, is the single most valuable artefact in a citation-first RAG system. The embeddings are commodity. The vetting is not.

When we built this agent for the Antwerp retailer, the thing we kept running into was the gap between "the LLM gave a plausible answer" and "the verkoper can defend that answer to a contractor on the phone." We solved it by refusing every answer that could not point at a vetted page, and by keeping the write paths to Navision and the WMS entirely off-limits. The same discipline runs through the rest of our AI agents work.

Key takeaway

If the agent cannot point at the exact vetted PDF page, it does not answer. The read path and the write path stay on opposite sides of a fence enforced at the network layer.

FAQ

Why insist on a citation for every answer?

Because the verkoper has to defend the answer to a contractor on the phone. Without a vetted page to point at, you cannot tell a confident model from a wrong one in time to matter.

Can the agent update product pages or WMS pick notes automatically?

No. The agent reads only. Write paths to Navision and the WMS have no credentials at the network layer. Humans still write, the agent saves them research time.

How big does the eval set need to be?

Ours is 312 questions, written by the operations lead and two senior verkopers. The size matters less than who wrote it. Ops-built eval sets catch failures synthetic ones miss.

What about scanned and password-protected PDFs?

Scan them through a layout-aware OCR pass so tables survive. For passwords, document the per-supplier credentials once and store them in a secret manager. Quarantine anything corrupt.

Do you need a vector database for 38,000 PDFs?

Postgres with pgvector handled this corpus fine. The bottleneck was never vector recall. It was chunking quality, supplier scope filtering, and the multilingual re-ranker.

ragknowledge baseai agentsarchitectureintegrationsoperations

Building something?

Start a project