RAG

Clinical RAG playbook: a citing agent on legacy EPD data

A 31-person Eindhoven prosthetics clinic answers 780 fitting questions a week. Their EPD is 14 years old. Here is how we shipped a RAG agent that cites every passage.

Jacob Molkenboer· Founder · A Brand New Company· 13 Sept 2025· 11 min

Open oak index card drawer on ivory paper, cream card with green tab, brass guide, linen ribbon, side light.

The 11pm phone call

A clinic owner calls. It is late. An orthopedist in Rotterdam has been waiting forty minutes for the fitting history of a patient her surgeon needs to operate on Monday. The file exists. It is on paper. It is in a binder. The binder is in a row of binders that fills one wall of the archive room. The night staff cannot find it.

This was the situation we walked into at a 31-person prosthetics and orthotics clinic in Eindhoven. Roughly 780 calls a week come in from orthopedists asking the same kind of question. Which liner did you fit on patient X in 2019? What was the alignment offset? Did the structural test ever flag anything per ISO 10328? The clinic stores 18,000 scanned fitting reports going back to 2004. The structured records live in a 14-year-old PHP-based EPD that was never built for export. The answers are in there. Finding them is the bottleneck.

What we shipped is a RAG agent that answers those calls in under nine seconds, with every claim hyperlinked to the page of the scan it came from, and refuses to speak when it cannot cite. Here is the playbook.

Constraints first, model second

The hard constraints came from the clinic, not from us:

Nothing leaves the building. Patient data stays on the on-prem server.
Every answer cites the source passage. No citation, no answer.
The orthopedist verifies in one click. If the link goes to page 47 of a 60-page scan, that is one click too many.
The agent reads both the modern EPD records and the scanned paper era.
It runs on the existing server. No new GPU.

That last one decides more than people expect. The existing box is a four-year-old Dell rack server with 32 cores and 256 GB of RAM. No GPU. Adding one means buying it, securing it inside their network, and signing a new DPIA with the patients' representative council. None of that happens fast.

We landed on a CPU-only stack: Qdrant in single-node mode, BGE-M3 multilingual embeddings, a small reranker, and a Dutch-tuned generator that runs comfortably in 24 GB of RAM. End-to-end latency sits around six seconds for a typical query, twelve for a complex multi-document one. Good enough for a phone call.

The OCR pass is the pipeline

This is the part everyone underestimates. Scanned fitting reports from 2004 are not PDFs. They are fax-quality TIFFs with Dutch shorthand, occasional Sharpie annotations, and a stamped header that overlaps the patient name about eleven percent of the time. Run a generic OCR and you get a corpus that looks fine and is wrong in dangerous places.

We ran a three-pass pipeline:

Layout detection per page with a vision model that returns block coordinates and block types (typed text, handwriting, form field, signature). The clinic used four distinct report templates between 2004 and 2017, plus three sub-variants for paediatric fittings. The vision model needed two weeks of labelling to learn them. Off-the-shelf layout models scored under 70% block-type accuracy on the older scans. The tuned one cleared 96%.
OCR per block, with separate models for typed Dutch and for handwriting. Handwriting OCR runs slower and only on detected handwriting blocks, which keeps total pass time manageable.
A confidence check that flags low-certainty fields and routes them to a human reviewer. We told the clinic up front: about four percent of fields would need a human look. They accepted that. The alternative was a confident-sounding agent that occasionally invented a liner brand.

Warning

If your OCR pipeline has no human-in-the-loop step for low-confidence fields, your RAG agent will hallucinate at the OCR layer and you will never see it in the retrieval logs. The lie is upstream of the model.

Chunking that respects the document

Generic chunking (500 tokens with 50 overlap) destroys clinical context. A fitting report has sections that mean different things: patient history, measurement table, prescribed components, structural test results, follow-up notes. Chop those into uniform tiles and you will return a "structural test result" passage that is missing its date and its patient ID, and the agent will cite it anyway.

We chunked by section, using the layout output from the OCR pass to decide boundaries. Every chunk carries:

the patient pseudonym (the clinic does not allow real names in the index)
the report date
the section type
page coordinates so the citation link can deep-link to the exact region
a hash of the source file for tamper detection

For the legacy EPD, we exported the relevant tables straight from MySQL and treated each EPD entry as its own chunk class, joined to the scan corpus by patient ID. The EPD has structured fields. Do not over-engineer it.

Pulling data out of the legacy EPD

The EPD is a PHP application from the late 2000s. There is no public API. There is a clunky CSV export that loses field-level timestamps. Our approach was direct: a read-only MySQL user, a nightly job that streams new rows to a parquet snapshot, and a checksum on the source tables to catch schema drift.

-- read-only role for the agent's ingest job
CREATE USER 'rag_ingest'@'10.0.0.%'
  IDENTIFIED BY '...';

GRANT SELECT ON epd.patient_fitting TO 'rag_ingest'@'10.0.0.%';
GRANT SELECT ON epd.component       TO 'rag_ingest'@'10.0.0.%';
GRANT SELECT ON epd.test_result     TO 'rag_ingest'@'10.0.0.%';
GRANT SELECT ON epd.note            TO 'rag_ingest'@'10.0.0.%';

FLUSH PRIVILEGES;

We do not delete from the EPD. We do not write to it. The Hacker News front page had a piece this week arguing that the only scalable delete in Postgres is DROP TABLE. The same lesson applies harder in a clinical database where every row is a regulated record under the Dutch AVG. The ingest job is read-only by policy and by grant. Schema-drift checksums run before every snapshot; a mismatch halts the job and pages a human rather than silently re-indexing against a changed shape.

The authority layer: ISO and NEN as ground truth

Here is the part that makes the agent useful instead of just fast.

A typical orthopedist question is not what did you do. It is what did you do and was it within spec. Spec means ISO 10328 for lower-limb prosthesis structural testing, or NEN-EN 12183 for manual wheelchairs, or one of a dozen sibling standards. If the agent answers from patient records alone, it is half the answer.

We ingested the standards corpus as a separate, vetted index. Every clinical passage that touches a testable property (axial load, fatigue cycles, propulsion force) is paired at retrieval time with the matching clause from the relevant standard. The generator is constrained to cite both: one source from the patient record, one source from the standards corpus. If either is missing, the agent says so and stops.

Takeaway

In regulated domains, retrieval is two indexes glued together: the messy patient corpus and the clean standards corpus. The model's job is to refuse when only one of them shows up.

The generation step does very little

We deliberately kept the generator small and dumb. Its job is to produce a one-paragraph summary, a citation block, and a confidence label. That is it. The orthopedist does not want prose. They want the answer, the page link, and a sense of how sure the system is.

A typical response looks like this:

Patient PSEUD-8842, fitting dated 2019-04-11.
Liner: Ottobock 6Y75, size 28.
Alignment offset: 5 mm posterior.
Structural test: passed per ISO 10328 P5 loading.

Sources:
  [1] Fitting report 2019-04-11, page 3
      /scans/2019/04/PSEUD-8842_fit_001.pdf#page=3
  [2] EPD entry 2019-04-11T14:22, component table
      epd://patient_fitting/884201942
  [3] ISO 10328:2016, clause 6.3, loading level P5

Confidence: HIGH (4/4 retrieval matches above threshold)

Note what is missing. No hedging. No "based on the available information." No preamble. The orthopedist clicks source [1], sees page 3 of the scan, and the call is over in 90 seconds.

Evaluation the clinic could trust

The clinic's medical director is a careful person. She did not want a vendor demo. She wanted to see the agent fail.

We built a 400-question evaluation set with three of her staff. Each question had a known correct answer pulled by hand from the archive. The set was weighted: 60% routine fitting queries, 25% questions that required a standards reference, 10% deliberately ambiguous cases where the right answer was "refuse and call a human", and 5% deliberately wrong-premise questions (asking about a patient who was never seen, asking about a component never fitted). That last 15% existed only to keep the refusal muscle honest.

We ran the agent against the full set weekly during build, and we kept every failure. A few patterns emerged:

OCR errors on handwritten alignment values produced the most dangerous wrong answers. Mitigation: handwriting blocks always route to human review for any numerical value.
Older reports (pre-2010) sometimes used component names the modern catalog does not recognise. Mitigation: a small lookup table maintained by the senior fitter, versioned in git.
The agent occasionally cited a standards clause that had been superseded. Mitigation: a version tag on every standards chunk and a hard filter on the retrieval call.

The release bar was: 98% correct-or-refuse on the eval set, with zero confidently-wrong answers in three consecutive runs. It took us six weeks to hit it. The first three of those weeks were OCR work.

Provenance as the design principle

The most discussed story on Hacker News this week is about a Brazilian "homegrown" LLM that turned out to be a merge of an existing open model. The lesson is not that merges are bad. The lesson is that provenance is the whole game. A model that cannot show where it came from is a model nobody serious will deploy.

The same logic governs clinical RAG. A clinician will not accept an answer they cannot trace. We wrote the citation contract into the retrieval layer, not the prompt: if the retrieval call returns fewer than two sources above the threshold, the agent route returns a hard refusal before the generator ever runs. The model never sees a question it could answer with a hallucination, because it never gets the chance.

Keeping it running

A RAG agent in a clinic is not a demo. It runs at 02:00 on a Saturday when an emergency department needs to know whether the patient on the trolley was ever fitted with a specific liner. We instrumented three things and ignored everything else.

First, retrieval scores. Every query logs the top-five retrieval scores and the threshold. A weekly review flags drift: if the median top score is sliding, something is rotting in the index, usually because a new report format slipped past the layout detector. Second, refusal rate by category. A spike in refusals on a specific report year usually means the OCR confidence threshold needs retuning for that batch. Third, eval-set replay on every model or index change. The 400-question set runs in nineteen minutes on the existing server, and the team will not promote a change with even one confidently-wrong regression.

Backups are boring and important. The indexes themselves rebuild from the source scans and the EPD snapshot, so they are reproducible rather than backed up. The eval set, the lookup table, the layout detector weights, and the standards version map all live in a git repo with off-site mirrors. If the server burns down, we can rebuild the agent in a day from clean inputs.

What this costs and what it returns

The clinic measured two things after eight weeks in production. Average time to answer an orthopedist's phone query dropped from about seven minutes to under two. The refusal rate, the share of questions the agent declines, settled at about six percent. The clinic prefers that number to the alternative. Refusals route to a human, the same way they did before, only now with the binder already pulled because the agent has already told the human which patient and which year to look up.

What you can do this afternoon

If you are sitting on a corpus of scanned records and a legacy EPD and you want to know whether a regulated RAG agent is feasible, do one small thing today. Pick fifty questions your staff actually answer by hand, write the correct answer next to each one, and store them as your eval set. That eval set is the project. Everything else is plumbing.

When we built this for the Eindhoven clinic, the hard part was not the RAG agent itself. It was the OCR pipeline and the discipline to refuse to answer when the retrieval was thin. We ended up shipping a system that refuses about six percent of queries, and the clinicians prefer that to a confident agent that is sometimes wrong.

Key takeaway

In regulated RAG, retrieval is two indexes glued together: the messy patient corpus and the clean standards corpus. The model's job is to refuse when only one shows up.

FAQ

Why not send the data to a hosted LLM?

Patient records cannot leave the clinic under their compliance posture. Everything runs on-prem: vector store, embeddings, reranker, generator. Slower than cloud, but the alternative was no project at all.

How do you handle OCR errors on handwriting?

Handwriting blocks are detected separately from typed text. Any numerical value extracted from a handwriting block is flagged for human review before it enters the index. About four percent of fields get touched by a reviewer.

Why two indexes instead of one?

Patient records and regulatory standards have different trust profiles and different update cadences. Keeping them separate lets the agent enforce a citation rule: one source from each, or it refuses to answer.

What happens when the agent cannot find an answer?

It refuses, names what was missing (patient record, standards clause, or both), and routes the call to a human. Refusal rate is about six percent. The clinic considers that the correct behaviour.

How long did the project take?

Roughly four months end to end. The first six weeks were the OCR pipeline and the eval set. Retrieval, generation and the citation contract took the rest. The clinic ran a four-week parallel period before cutover.

ragai agentsknowledge baselegacy sitesarchitecturecase study

Building something?

Start a project