RAG

RAG for medical device IFUs: the citation-first playbook

An Eindhoven distributor was answering 1,940 product questions a week from 28,000 IFU PDFs. We had to ship a RAG agent that cites a CE certificate before it speaks.

Jacob Molkenboer· Founder · A Brand New Company· 18 Jun 2026· 10 min

Open oak index-card drawer on ivory paper, brass divider, cream card, ledger stack with green tab, manila tag, dark backdrop.

It's Monday, 9:14 in Eindhoven. The sales coordinator at a 29-person medical device distributor opens the shared inbox and finds forty-three new questions about catheter compatibilities, IFU revisions, and contraindications. By Friday there will be 1,940 of them. On her second monitor sits Notum, the custom MDR registration system someone built in PHP in 2012, quietly tracking which of their 4,100 product references are currently authorised, which are grandfathered in, and which had a CE certificate quietly expire on a notified body's portal last week.

The team has 28,000 instructions-for-use PDFs scattered across a SharePoint that nobody loves. They have a regulatory affairs lead who already works Saturdays. They have a sales pipeline that wants every offer to ship with a correct, current, traceable answer because MDR Article 10 obliges them to. And they have a board that asked, reasonably, whether a RAG agent could take the load off.

It could. But not the way most RAG agents get built. This is the playbook we wrote for them.

The wall standard RAG hits on regulated documents

The default RAG stack — embed everything, chunk it, retrieve top-k, hand it to a model — produces an answer that sounds correct. In a medical device context, sounding correct is the failure mode.

The same IFU lives in five revisions over its lifetime. Revision 3.1 might warn against use with patients on anticoagulants. Revision 3.2 might soften that warning to "consult the prescriber." If your embedding index can't tell those two passages apart, and a cosine similarity index categorically cannot, the model will pick whichever chunk scored highest. That chunk might be from a revision withdrawn fourteen months ago.

Worse: the device the chunk describes might be operating under a CE certificate that the notified body suspended in March. The IFU is real. The text is correct for its time. The answer is still wrong, because the authorisation behind it is gone.

Article 10 doesn't ask you to be plausible. It asks you to be traceable. We had to build a retrieval layer where every passage was tied, before retrieval, to a currently valid certificate and a currently authoritative revision. Nothing about the model layer fixes that. The structure has to be in the index.

The document graph comes before the chunks

We started by ignoring the 28,000 PDFs entirely for two weeks and modelling the world they describe.

Every device has a UDI-DI. Each UDI-DI is tied to one or more CE certificates, each with a notified body number, an issue date, an expiry date, and a scope. Each device carries a sequence of IFU revisions, each tied to a date window during which it was the authoritative version. Each post-market surveillance report is anchored to a UDI-DI and a reporting period. Each device sits in a class (IIa, IIb, III) that changes which review steps an answer needs to pass.

That's a graph, not a bag of documents. We modelled it in Postgres before a single PDF was chunked:

create table device (
  udi_di      text primary key,
  product_ref text not null,
  device_class text not null,                -- IIa, IIb, III
  status      text not null                  -- active, grandfathered, withdrawn
);

create table ce_certificate (
  cert_id       text primary key,
  notified_body text not null,               -- e.g. 0123, 0344
  issued_on     date not null,
  valid_until   date not null,
  scope_summary text not null
);

create table device_certificate (
  udi_di  text references device,
  cert_id text references ce_certificate,
  primary key (udi_di, cert_id)
);

create table ifu_revision (
  ifu_id          uuid primary key,
  udi_di          text references device,
  revision        text not null,             -- "3.2"
  effective_from  date not null,
  effective_until date,                      -- null while current
  source_pdf      text not null
);

create table pms_report (
  report_id    uuid primary key,
  udi_di       text references device,
  period_start date not null,
  period_end   date not null,
  source_pdf   text not null
);

The PDFs only get to be chunked and embedded once they're hung off a node in this graph. An orphan PDF, one that doesn't belong to any device the distributor actually lists, never enters the index. We deleted 3,200 of those in week one. Most were marketing material. Four were genuinely worrying: IFUs for devices the distributor had stopped representing in 2019 but had never removed from the shared drive. If a retrieval layer hands those passages to a sales agent in 2026, you have an Article 10 problem that no log-after-the-fact will save you from.

Citing the certificate, not the chunk

Once you have a graph, the retrieval result is no longer a passage. It's a tuple:

{
  "passage": "When the device is used in patients receiving anticoagulant therapy, ...",
  "ifu_revision": "3.2",
  "udi_di": "0419901234567890",
  "product_ref": "CAT-IIB-4471",
  "cert_id": "CE-0344-MDR-21887",
  "notified_body": "0344",
  "valid_until": "2027-03-12",
  "device_status": "active"
}

The model never sees a passage without that envelope. The system prompt is short and unambiguous: every claim in your answer must be followed by a citation pointing to the IFU revision and the certificate that authorises the device. If two retrieved passages disagree, prefer the one whose IFU revision is currently effective. If the current revision is silent on the question, say so. Do not synthesise.

We borrowed a trick from Anthropic's contextual retrieval work: at embedding time, prefix each chunk with a short generated context line ("This passage is from IFU revision 3.2 for catheter CAT-IIB-4471, currently authorised under CE-0344-MDR-21887"). That alone moved recall on revision-specific questions from 71% to 92% on our internal evaluation set of 600 hand-labelled Q&A pairs. The evaluation set was built by the regulatory lead and the senior sales coordinator over three afternoons. We will not ship a regulated RAG agent without one.

The freshness gate sits between retrieval and generation

Embedding indexes drift. Notified bodies suspend certificates without warning. PMS reports get filed mid-quarter. The freshness gate is the deterministic check we run after retrieval and before generation:

def freshness_gate(hits: list[Hit], today: date) -> list[Hit]:
    kept = []
    for h in hits:
        cert = notum.current_cert(h.cert_id)
        if cert is None or cert.valid_until < today:
            continue                            # cert lapsed
        if cert.status != "active":
            continue                            # suspended / withdrawn
        rev = notum.current_revision(h.udi_di)
        if h.ifu_revision != rev:
            continue                            # superseded text
        if notum.device_status(h.udi_di) != "active":
            continue                            # device pulled
        kept.append(h)
    return kept

If kept is empty, the agent does not call the LLM at all. It returns a structured refusal: "No currently authoritative source supports an answer to this question. Routing to regulatory affairs." That refusal carries a payload, the original question, the candidate hits that were filtered out, and the reason each one was dropped, straight into a Teams channel the compliance lead actually reads.

One discipline matters here. The freshness gate has to run on the inputs, never on the outputs. If you generate first and filter the answer for stale citations after, the model has already committed to a position. Post-hoc filtering then looks like censorship of an answer that was "nearly right," and the team starts arguing about edge cases. Filter the inputs and the model never gets the chance to be confidently wrong.

Wiring Notum in without rewriting Notum

Notum is fourteen years old, written in PHP 5.6 with a MySQL 5.5 backend, and the original author left in 2017. Nobody on the team has the appetite to rewrite it, and we agreed that's the correct instinct. It works. Its data model, UDI-DI, certificate, revision, status, has survived three regulatory regimes including the move from MDD to MDR. The right move was to read from it, not through it.

We added one read-only MySQL view that flattened the three tables the freshness gate needs. A small Python sync job pulls that view every fifteen minutes into a Postgres mirror, with a row-level checksum so we don't reindex what hasn't changed. The agent only ever talks to the mirror. Notum doesn't know we exist, which is exactly the contract you want with a system that nobody wants to redeploy.

One reconciliation surprise: 137 SKUs had a "current revision" in Notum that disagreed with the public record on EUDAMED. Mostly stale Notum entries from a 2023 product line refresh that never got back-filled. We exported the diff, the regulatory lead worked through it over four afternoons, and Notum was reconciled before the agent went live. The agent surfaced the problem. The agent did not fix it. That is the correct division of labour.

What twelve weeks of production looked like

The agent has now been live for three months. We sample 5% of its answers each week for audit, and the regulatory lead reviews them on Friday morning.

Roughly 78% of the 1,940 weekly questions get answered without human escalation. The bulk are compatibility, contraindication lookups, and "which IFU applies to this lot number" questions.
22% return the structured refusal and route to compliance. About a third of those turn out to be genuinely unanswerable from current documentation, usually a CE certificate transition or a missing PMS report. The agent is finding the gaps in the document set, not papering over them.
Zero hallucinated citations across two audit rounds. Every citation the model emits maps to a real cert ID, a real IFU revision, and a real passage. The citation-first envelope is what buys you that.
The sales coordinator's Monday inbox clears by 10:30 instead of 16:00. The board noticed that one before they noticed the rest.

Takeaway

For regulated RAG, the index is not a flat corpus. It's a graph hanging off the same authority records your business already uses to ship products. If retrieval can't say "this passage, this revision, under this certificate," the model can't either.

Where we'd push next

Three things sit on the next quarter's roadmap. First, watch the notified bodies' public registers for certificate status changes and trigger reindex within the hour, instead of waiting on the fifteen-minute Notum sync. Second, plug the agent into the offerte engine so an outgoing quote carries an inline citation block showing the regulatory basis for every device line item. Third, and this is the one the regulatory lead asked for, extend the freshness gate to PMS report cadence, so that a device whose latest report is overdue cannot return positive answers without an explicit compliance flag.

When we built this IFU agent for the Eindhoven distributor, the thing we ran into was that Notum's idea of "current revision" had drifted from EUDAMED on 137 SKUs, and no amount of clever retrieval was going to mask that. We ended up running a one-time reconciliation with human eyes before the agent answered a single question, the kind of unglamorous step we now build into every RAG agent we ship into regulated workflows.

Open the three most-asked product questions from last week. For each one, can you, by hand, in under five minutes, point to the exact document, the exact revision, and the certificate that currently authorises the device described? If the answer is no, no RAG layer is going to fix it. Fix the graph first.

Key takeaway

In regulated RAG, the index is a graph hanging off your authority records. If retrieval can't say 'this passage, this revision, under this certificate,' the model can't either.

FAQ

Can a RAG agent be used for MDR-regulated answers at all?

Yes, but only when every retrieved passage carries its authorising CE certificate ID and IFU revision, and a deterministic gate refuses to call the model when those aren't current.

Why not rewrite the 14-year-old MDR registration system?

Because it works. Reading from a flattened view of its tables is faster, cheaper, and lower-risk than touching code that has already survived three regulatory regimes.

How do you stop the model from hallucinating citations?

Cite at the retrieval layer, not the generation layer. The model can only emit citations that came back inside a hit envelope. No envelope, no answer, by design.

How fresh does the freshness gate need to be?

Faster than the slowest regulatory event you care about. Nightly is too slow for a certificate suspension. We run a 15-minute Notum sync and watch notified-body registers for spikes.

ragai agentsknowledge basearchitecturecase studyintegrations

Building something?

Start a project