RAG
Cited RAG for pump datasheets: a Dordrecht case study
A 27-person Dordrecht pump distributor now answers 1,180 weekly technical questions across 38,000 datasheets. Every passage cites back to the source PDF first.

Tuesday, 14:07. An applications engineer in Dordrecht is staring at a Grundfos CRN 32-3 datasheet on one screen and a chat window from a marine contractor in Rotterdam on the other. The contractor needs an NPSH curve at 1450 rpm with a different impeller trim than what is on the page. The engineer has answered this kind of question 41 times this month. The answer lives in three places: page 8 of the PDF, a footnote in a 2019 KSB engineering bulletin, and the 15-year-old onderdelen-catalogus that started life as a Microsoft Access file in 2010 and now lives in MySQL behind a brittle ODBC bridge.
That bridge, and the 38,000 datasheets piling up behind it, is the reason this company called us to build a cited RAG system.
The 1,180-tickets-a-week problem
Numbers first. The distributor, 27 people, mostly outside sales and applications engineers, was fielding about 1,180 technical questions a week across email, WhatsApp Business, and the chat widget on their website. Roughly 40% were repeats. About 25% needed a parts cross-reference that only existed in the legacy catalogus. The rest needed a real engineer reading a real datasheet.
The applications team had two failure modes. They were either too slow (a four-hour response time on a Monday morning meant the contractor had already called a competitor) or too fast and wrong (a quoted impeller diameter that did not match the curve the customer actually needed).
Their first instinct, like everyone's, was to throw GPT at it. They tried. The results were predictable. The model would confidently quote an NPSH figure that did not appear in any datasheet they sold. It would invent a Wilo part number with the right prefix and the wrong suffix. It would tell a customer that a pump was ATEX-certified when it was not.
This is where most companies stop and conclude that AI does not work for them. It does. Just not like that.
38,000 PDFs that nobody had read end-to-end
The first thing we did was count the datasheets. The number, 38,142 at the time we started, was higher than the team's own estimate by an order of magnitude. Their network share had been accumulating since the late 1990s. Grundfos, KSB, Wilo, plus eleven smaller brands. Versions, revisions, deprecated models still in service at three Rotterdam waste-water plants.
We did not throw all 38,000 into a vector store on day one. That is the most common RAG mistake we see. A vector store full of unverified, undated, possibly-superseded documents is a confident liar with infinite memory.
Instead we built a small pre-processing pipeline:
def ingest(pdf_path: Path) -> Doc | None:
meta = extract_meta(pdf_path) # brand, model, revision, date
if not meta.brand or meta.brand not in VETTED_BRANDS:
return quarantine(pdf_path, reason="unvetted_brand")
if meta.revision_date < CUTOFF or meta.superseded_by:
return quarantine(pdf_path, reason="stale_revision")
pages = render_pages(pdf_path, dpi=300)
text = vlm_extract(pages) # tables matter, pump curves are tables
chunks = chunk_by_section(text, meta)
return Doc(meta=meta, chunks=chunks, source_uri=pdf_path)
About 9,400 PDFs ended up in quarantine on the first pass. Most were superseded revisions. Around 600 were genuinely orphaned, brands the distributor had stopped carrying years ago. We did not delete those. They went into a separate index that the agent can read but not cite.
That distinction, read-but-not-cite, turned out to matter more than anything else in the architecture.
Bridging a 15-year-old Access catalogue
The onderdelen-catalogus has a story.
It was built in 2010 by the founder's brother-in-law, who was a Delphi developer in his day job and an Access hobbyist on weekends. It has 47 tables, three of which have names that begin with tbl_OUD_ and which the company has been afraid to drop since 2014. It was migrated to MySQL 5.1 in 2012 via an ODBC bridge that nobody fully understands, including the original author. It has been the source of truth for parts cross-references for 14 years.
You cannot rip this out. We did not try.
Instead we put a thin read-only adapter in front of it that emits structured rows the agent can call as a tool:
@tool
def lookup_part(query: str) -> list[Part]:
"""
Cross-reference a part number, OEM number, or free-text
description against the onderdelen-catalogus. Returns up
to 8 matches with manufacturer, model, supersession chain,
and stock status. Never invents a row.
"""
rows = catalogus.query(query, limit=8)
return [Part.from_row(r) for r in rows]
Two rules in that adapter that the system depends on.
The agent cannot write to the catalogus. Ever. Not directly, not through a wrapped function, not through the offerte-engine.
The adapter returns rows or it returns nothing. It never returns a best guess. If the query yields zero rows, the agent is told zero rows and has to ask the customer for clarification rather than hallucinate a match.
The MySQL bridge is, frankly, ugly. The adapter is not. That is the contract we wanted.
The citation gate
This is the part of the architecture we are proudest of, and it is the smallest part of the code.
Every answer the agent generates passes through a citation gate before it reaches the offerte-engine, the email composer, or the chat widget. The rule is: every numeric claim, every part number, every certification, every flow-rate or pressure or RPM figure must be tied to a specific passage in a specific document at a specific page. If the gate cannot find the citation, the agent does not answer. It escalates to a human.
The cheapest way to stop a RAG agent from lying is to refuse to let it speak without a footnote. The hard part is enforcing this in code, not in the system prompt.
The gate is about 180 lines of Python. It parses the model's draft answer, extracts every claim that looks like a number or a part identifier, and checks each one against the retrieved context. The context, crucially, carries provenance: page number, document URI, retrieved-at timestamp, and a hash of the source bytes.
The system prompt asks the model to cite. The gate makes sure it actually did.
def gate(draft: Answer, context: list[Chunk]) -> Verdict:
claims = extract_claims(draft.text)
for c in claims:
evidence = find_evidence(c, context)
if not evidence:
return Verdict.escalate(reason=f"uncited:{c.kind}:{c.value}")
if evidence.confidence < 0.82:
return Verdict.escalate(reason=f"weak:{c.kind}")
draft.attach_citation(c, evidence)
return Verdict.allow(draft)
The 0.82 threshold was not pulled out of the air. It was tuned against an eval set we built from 600 real historical tickets that the applications team had already answered. We knew the right answer for each one. We tuned until the false-positive rate (agent gave a wrong number with confidence) was below 0.3%, then accepted whatever escalation rate that produced. It produced about 18% escalations on day one. It is now around 11%.
Chunking that respects the engineer
Pump datasheets are not prose. They are tables, curves, and footnotes. Naive chunking, split every 500 tokens, breaks them in ways that are subtly catastrophic. A curve loses its axis labels. A table loses its header row. A footnote loses the asterisk it was qualifying.
We chunk by document section using the structure the manufacturers already provided. Grundfos, KSB, and Wilo publish their datasheets with relatively consistent section anchors: Technical data, Performance curves, Dimensions and weights, Materials of construction. You can verify the convention yourself by pulling any current curve from the Grundfos Product Center — the anchors have been stable for years. We extract those sections and treat each as one chunk, with a separate small chunk per individual curve image (rendered as an image plus a VLM-extracted text description that retains the axis units).
Tables get treated as tables, not as flattened strings. The agent sees them as rows-and-columns and can reason about them as such. This sounds obvious. Most RAG pipelines do not do it.
The eval set that paid for itself
We could not have built the citation gate without an eval set, and we could not have built the eval set without three afternoons of an applications engineer's time.
600 historical tickets. For each one, the engineer marked the right answer, the document that contained it, the page, and one line of why. That set became the test harness. Every change to the retriever, the chunker, the gate, or the prompt is checked against it before it goes near production.
This is the unglamorous part of shipping a RAG system. There is no model upgrade and no clever prompt that substitutes for it. Hamel Husain has been arguing this point for over a year in his widely-read piece on why your AI product needs evals, and we agree with him: the moat is the eval set, not the model. Three afternoons of engineer time is the cheapest investment you can make in a RAG system you intend to keep.
Results, with the boring caveats
After four months in production, the 1,180 weekly questions are now handled in roughly this split: 71% answered fully by the agent with citations, 18% escalated to an engineer with the agent's draft attached, 11% routed directly to a human because the gate refused to produce an answer at all.
Median response time on the 71% bucket is 38 seconds. Median response time on the 18% bucket, where the engineer reviews a drafted answer, is 9 minutes. The remaining 11% still take whatever they took before: an engineer with a PDF.
The applications team did not shrink. Two of them now spend a third of their week on the eval set and on reviewing the agent's edge cases. The other engineers have time to do site visits again, which is what the founder wanted in the first place.
The offerte-engine still does not let the agent write to it. The agent prepares a draft quote with parts and prices. A human clicks approve. That has not changed and we do not plan to change it.
If your RAG agent can directly write to a system of record, you do not have a RAG agent. You have a confident intern with database credentials. Keep the write path human-gated until your eval set says otherwise.
What we would do differently
Two things.
We spent too long on the chunker in the first month. The first version was good enough. The third version was marginally better and cost us three weeks. The eval set would have told us this if we had built it first. We did not. We build the eval set before the retriever now.
We underestimated how much the Access-to-MySQL bridge mattered to morale. The team's previous experience with software vendors was that everyone wanted to replace it. We did not replace it. We wrapped it. The first conversation in which we said "we are not touching the catalogus" was the moment the project became real for the founder.
The smallest thing you can do today
Pick 50 tickets from the last month that your team already answered. Mark the right answer, the source document, and the page. That is your eval set. Everything else, the vector store, the chunker, the model, is downstream of it.
When we built the RAG agent for this Dordrecht distributor, the part that took longest was not the retrieval pipeline. It was teaching the system to refuse to answer when it should not. We ship this end-to-end under AI agents; the citation gate has become a standard module in every knowledge-base build we deliver.
Key takeaway
A RAG agent is only as honest as its citation gate. Build the eval set first, refuse to answer without provenance, and keep the write path human-gated.
FAQ
How long did the build take end to end?
Four months from scoping to production. Roughly two months on ingest, chunking and the citation gate, and two on the eval set, the catalogus adapter, and live tuning against real tickets.
Did you replace the Access-to-MySQL catalogue?
No. We wrapped it in a read-only adapter that the agent calls as a tool. The legacy schema and the existing internal applications continue to use it unchanged.
What stops the agent from inventing a part number?
The citation gate. Every numeric or identifier claim must tie to a retrieved passage with provenance. Uncited or weakly-cited claims trigger an escalation instead of an answer.
Which model powers the agent?
We do not pin this. The retriever and the gate are model-agnostic. We swap the generation model based on cost and latency, and the eval set tells us within an hour whether the swap is safe.
Does the agent write to the quoting system?
No. It prepares a draft offerte with parts and prices. A human reviews and approves. The write path stays human-gated by design, not because the model is not capable of more.