RAG
RAG in a regulated dental lab: a citation-first playbook
A 27-person tandtechnisch lab fields 1,320 behandelaar questions a week. Here is how we built a RAG agent that cites every passage before it touches the order queue.

Tuesday 07:40, Enschede. The QA lead at a 27-person tandtechnisch laboratorium opens her inbox: 38 messages from behandelaars overnight. Can I cement this zirkonia kroon with Variolink Esthetic? Is the IFU for this implant abutment still current? Which cleaning protocol applies to this casting alloy? She has 22,000 procedure-documenten in the QMS to check answers against. By 09:00 she has triaged six. The other 32 wait. By Friday, 1,320 questions will have arrived.
The lab does crown-and-bridge and implant work, audited annually against NEN-EN-ISO 13485, registered under EU MDR. Their procedure library is mostly PDF, some scanned, some Word, plenty pasted into a 13-year-old custom Lab-Manager built in PHP 7.4 by a developer who has now retired in Twente. Two senior tandtechnici used to absorb most of the question volume on top of their bench work. Both wanted to stop and go back to making crowns full-time.
The brief: build a RAG agent that answers behandelaar questions accurately, cites the exact passage in the exact IFU, and never lets an uncited answer flow into the kroon-en-brug productie-orderqueue. The playbook below is what we ran. It now runs unchanged on every regulated-knowledge project we take on.
The constraint that shapes everything
In MDR and ISO 13485 environments, the value is not the answer. It is the audit trail. An answer that cannot be traced back to a vetted, version-controlled document is worse than no answer, because the auditor does not care that the model got it right. They care that there is a paper trail.
The architecture rule we set on day one: no answer ships without a citation, and every citation points to a document in the QMS with current effective status. If retrieval can't find a match, the agent says "no source found". It does not paraphrase from training data. It does not improvise. The model is allowed to be helpful only when the QMS gives it permission.
The corpus, sorted
22,000 documents is small in absolute terms. It fits in any vector database. The mess, on the other hand, is real. We inventoried it in two passes.
The first pass was structural: PDF-with-OCR-needed, PDF-text, DOCX, scanned image, BLOB-inside-Lab-Manager, hyperlink-only. Each gets a different extraction path. Tesseract for the scans, pdfminer for the text PDFs, a small Go service to walk the Lab-Manager BLOBs out into a staging bucket.
The second pass was editorial. Every document was tagged with status (effective, superseded, draft, archived), document type (IFU, risico-analyse, work instruction, CAPA, complaint), and product scope (crown, bridge, implant, ortho, removable). The retrieval index only contains documents tagged effective. Superseded and draft documents are stored but flagged: if a behandelaar asks about a historical procedure, the agent can find the superseded version and explicitly warn that it is no longer in force.
document_id: IFU-CR-2024-017
title: "Cementatie zirkonia kroon - Variolink Esthetic"
status: effective # effective | superseded | draft | archived
effective_from: 2024-11-03
superseded_by: null
doc_type: IFU # IFU | risk_analysis | work_instruction | CAPA
product_scope: [crown]
material_scope: [zirconia]
qms_owner: K. Veldhuis
review_due: 2026-05-01
source_path: /qms/cr/2024/IFU-CR-2024-017.pdf
sha256: 4a8c8d1f9b2e…
The retrieval filter is then a hard predicate: status == "effective" AND review_due > today(). Anything past review date drops out of the answer pool and triggers a Slack ping to the QA owner the same morning. That single rule caught nineteen overdue IFUs in the first week of staging — none of them were urgent, but none of them should have been answerable either.
Chunking that survives a tandtechnicus reading it
We tried character-window chunking first. It worked for the blog-scrape projects we'd done before. It failed here. An IFU has a Contra-indicaties section that is three lines long; splitting it across two chunks destroys the meaning. A risico-analyse has a table of hazards; chopping a row in half is worse than not retrieving it.
So we wrote a Dutch-aware section parser. It looks for known headings (Indicaties, Contra-indicaties, Bewaarcondities, Reinigingsprotocol, Risico, Validatie), tables, and numbered lists. Each section becomes one chunk, with the document title, heading path, and version number stamped into the chunk header.
def chunk_qms_document(doc: Document) -> list[Chunk]:
sections = split_on_headings(doc.text, KNOWN_HEADINGS_NL)
chunks = []
for sec in sections:
if sec.is_table():
chunks.append(Chunk(
text=sec.text,
heading_path=sec.heading_path,
doc_id=doc.id,
version=doc.version,
kind="table",
))
continue
for para_group in greedy_pack(sec.paragraphs, max_tokens=400):
chunks.append(Chunk(
text=f"{sec.heading_path}\n\n{para_group.text}",
heading_path=sec.heading_path,
doc_id=doc.id,
version=doc.version,
kind="prose",
))
return chunks
Average chunk landed around 280 tokens. The heading context inside the chunk text is not redundant. It is what makes the retrieval match the question's intent, because behandelaars rarely write the way the QMS does.
Retrieval, in two stages
Single-vector retrieval missed about 18% of the questions in our eval set, mostly because behandelaar Dutch ("kan ik die kroon ook cementeren met…") doesn't match QMS Dutch ("Cementatieprotocol voor lithiumdisilicaat-restauraties"). We stacked two stages.
Stage one is hybrid: BM25 plus dense vector (we use bge-m3, which handles Dutch well and is small enough to self-host). Take the top 40. Stage two is a cross-encoder rerank, scoped to the product class the behandelaar is asking about. Take the top six.
The cross-encoder costs roughly 8ms per chunk on a small GPU and gave us a 14-point accuracy bump on the eval set. Skipping it to save €40 a month on inference would have undone two weeks of corpus tagging.
The eval set is the work
We did not have one. We built it by exporting 600 historical questions from the Lab-Manager helpdesk module and asking the two senior tandtechnici to write down, by hand, what the correct answer was and which document it came from. That took two weeks. It is the single most valuable artifact in the project. Every architecture change since has been measured against it. If you skip this step, you are not doing RAG. You are guessing in production.
Wiring the 13-year-old Lab-Manager
The Lab-Manager runs on PHP 7.4 and a MySQL 5.7 instance that nobody wants to touch. It holds the patient-order map, the material registry, and the MDR-registratie — every device that leaves the lab is logged here. Schema changes were off the table from the kickoff call.
So we did not touch it. We built a thin read-only sidecar in Go that subscribes to the MySQL binlog through Debezium, emits Kafka events for each order, material, and IFU link, and materialises a Postgres read model the agent can query. The agent never writes to Lab-Manager directly. It reads the Postgres mirror, formulates the answer, and posts its recommendation as a comment on the order via the Lab-Manager's existing REST endpoint — which the previous developer, thankfully, built before he retired. The order itself stays human-controlled.
Never refactor a 13-year-old line-of-business app to make room for an AI agent. Wrap it, mirror it, read it. The old system keeps running. The new system never blocks it.
The gate before the production queue
This is the part of the playbook the auditors care about, and the part most teams underbuild.
Every answer the agent produces is a JSON object with three fields: answer, citations, confidence. Citations are pointers: document_id, version, chunk_id, heading_path. Before the answer can attach to a kroon-en-brug productie-order, a gate function runs.
def gate(answer: AgentAnswer, order: Order) -> Decision:
if not answer.citations:
return Decision.HUMAN_REVIEW # no source = never auto-approve
for c in answer.citations:
doc = qms.get(c.document_id)
if doc.status != "effective":
return Decision.HUMAN_REVIEW
if doc.product_scope and order.product not in doc.product_scope:
return Decision.HUMAN_REVIEW
if doc.version != c.version:
return Decision.HUMAN_REVIEW # cited a stale revision
if answer.confidence < 0.75:
return Decision.HUMAN_REVIEW
return Decision.AUTO_APPROVE
About 71% of answers auto-approve. The rest queue for the QA lead's morning review. She used to triage 1,320 questions a week. She now triages around 380. The 71% are not low-stakes — they include cementation protocols, alloy compatibility, cleaning sequences. They are simply the questions where the source is unambiguous and the gate can prove it.
Two things we got wrong on the first pass
First, we let the agent answer in English when behandelaars asked in English. Half the IFUs are Dutch-only, and the agent dutifully translated them. The QA team flagged it within a day: translated regulatory text is not the regulatory text. We forced answers to match the source-document language and added an explicit note when there was a mismatch: "Source document is in Dutch; the verbatim passage is below."
Second, we cached chunk embeddings but not chunk text. When an IFU got a new revision, the embeddings index updated but a stale chunk-text path persisted in the answer cache for about ten minutes. The agent cited the right document, with the wrong sentence. We now invalidate the answer cache on any QMS write event, and we hash the chunk text into the citation pointer so a mismatch is caught at the gate.
The numbers at six months
- 1,320 weekly questions → 380 to human review, 940 auto-answered with citation.
- Median answer latency: 2.1 seconds end-to-end.
- Citation accuracy on the eval set (does the cited passage actually answer the question): 94.6%.
- Two senior tandtechnici back on bench work full-time.
- Zero MDR audit findings related to the agent in the most recent surveillance audit.
The last bullet is the one the lab cares about. The first one is what paid for the build.
The playbook, distilled
If you are building RAG in a regulated environment, the rules that survived eighteen months of this project:
- Citation is not a feature. It is the contract. No citation, no answer.
- Status filters in retrieval beat clever reranking. An effective filter is worth more than any embedding upgrade.
- Section-aware chunking, not character windows, for any document a human actually reads.
- Build the eval set first, with the domain experts, on real historical questions. It will be slow. Do it anyway.
- Wrap the legacy system. Do not refactor it.
- Gate the queue, not the model. The agent will be wrong sometimes. The queue cannot be.
When we built this agent for the Enschede lab, the hardest part wasn't the retrieval — it was wiring the answer gate to a 13-year-old Lab-Manager without touching its schema, and proving to the QA lead that every auto-approved answer could survive an audit. We do this work as part of our AI agents practice, and the playbook above is what we now run on every regulated-knowledge project.
If you want to start tomorrow: pull a sample of 50 questions your team answered last month and write down, next to each, which document the answer came from. That list is your starting eval set. Everything else is downstream of it.
Key takeaway
In regulated RAG, the citation is the contract: gate the production queue on document status and version, not on model confidence.
FAQ
Why not use a general LLM with a long context window instead of RAG?
Because the auditor needs a citation, not a confident paragraph. Long-context answers cannot prove which document a claim came from, and regulated environments treat unsourced answers as defects.
Did you fine-tune a model on the lab's documents?
No. Fine-tuning bakes content into weights and breaks the audit trail when documents change. Retrieval over a versioned QMS, with a citation gate, is cheaper and more defensible.
How do you handle a document being superseded mid-question?
The QMS write triggers a cache invalidation and a re-index. The gate also compares the cited version against the current effective version, so any drift queues the answer for human review.
Could a smaller lab use this approach?
Yes. The corpus tagging and eval set carry most of the weight. The infra (Postgres mirror, hybrid retrieval, cross-encoder rerank) runs comfortably on a single mid-sized server for any lab under 100 staff.