Process automation

HL7v2 misrouting: a silent LOINC fallback and 680 lab results

A 28-person lab in Rotterdam shipped 680 ORU messages to the wrong huisarts before lunch on a Friday. The cause was one stale LOINC table. Here is what broke.

Jacob Molkenboer· Founder · A Brand New Company· 13 Mar 2026· 10 min

Manila envelope with chartreuse ribbon on ivory desk, brass tag and address card beside it, red ink smudge, side light.

The phone rang at 11:47 on a Friday. A huisartsenpraktijk in Schiebroek had just opened an ORU result message addressed to one of their patients, except the patient was not theirs. The CRP value looked plausible, units in mg/L, reference range intact, but the BSN at the top of the PID segment belonged to a 64-year-old man registered four kilometres south. By the time the lab's quality manager picked up, three more practices had called.

The lab is a 28-person medisch laboratorium in Rotterdam-Noord. They run about 4,200 result messages a day through a Mirth Connect channel that picks up CSV from the LIS, maps each analyte code against LOINC, resolves the destination practice via AGB code, and drops the HL7v2 ORU^R01 onto an OZIS VPN tunnel. The channel had run without a routing incident for nineteen months.

That Friday morning, 680 ORU messages went to the wrong practice before someone killed the channel.

The pipeline before the incident

A LOINC code is a global identifier for a laboratory observation. A CRP measured in serum is 1988-5. The lab does not route on LOINC, they route on the GP's AGB code, which is a Dutch healthcare-provider identifier. But the LOINC code is what tells the channel which test type a row in the LIS belongs to, and the test type drives a reference workflow that in a handful of cases overrides the default routing.

Specifically, for a small number of point-of-care panels that were ordered through a shared specialist intake, the LOINC code resolved to a different AGB than the one stamped on the LIS row. About 4% of daily traffic took that path. The rest went straight through.

Four times a year the lab pulls fresh reference data, a hash of the latest LOINC release plus their own panel-to-AGB overrides, into a versioned JSON file mounted at /srv/mirth/refdata/loinc-current.json. The Mirth Connect channel reads it on deploy. The previous quarter's file stays in place at /srv/mirth/refdata/loinc-2026-Q1.json, unread, as a manual rollback safety net.

The silent fallback

The Q2 update went out at 06:30. The DevOps engineer ran the lab's standard deploy script:

cd /srv/mirth/refdata
cp loinc-current.json loinc-2026-Q1.json
curl -fsSL "$INTERNAL/loinc-2026-Q2.json" -o loinc-current.json
mirth-cli redeploy --channel oru-routing

The curl returned a 304 from an upstream Squid proxy that had cached the previous quarter's file under the same URL. The deploy script did not check the file's release version. It checked that the file existed and parsed as JSON. Both were true. loinc-current.json was now a copy of loinc-2026-Q1.json, overwritten by another copy of loinc-2026-Q1.json. The channel redeployed cleanly. Health checks were green.

What had changed in Q2 was the override for one panel, a metabolic screening package that had been moved from the specialist intake back to direct GP ordering. The Q1 table sent results for that panel to the intake's AGB. By Friday lunchtime, every patient who had switched to direct ordering after April 1 had their results delivered to a practice they no longer attended.

Warning

An idempotent deploy is not a correct deploy. If your pipeline can pass health checks while operating on stale reference data, you do not have a working deploy. You have a working syntax check.

It took the on-call engineer an hour after the kill to find the proxy. The first hypothesis was a bug in the new override file, so they pulled the live JSON and diffed it against Q1. Identical. The second hypothesis was a Mirth deploy race, but the channel logs showed a clean reload at 06:30:14 and no further events until the kill. Only when they replayed the curl from the deploy host with verbose flags did the 304 show up, and only then did anyone remember the Squid proxy that had been put in front of the supplier portal eight months earlier to keep on-call laptops from saturating the office uplink. Nobody had thought to mark the supplier URL as no-cache. The proxy was doing what it had been asked to do.

Blast radius

680 ORU messages reached 53 distinct AGB codes between 06:31 and 11:53. Of those, 47 were the wrong recipient. Six were correct by coincidence, the patient happened to be shared with the receiving practice. The lab killed the channel at 11:53, paused all outbound HL7 traffic, and started a four-step recovery:

Outbound channel halted. LIS queue allowed to fill, about 1,400 messages over the next six hours.
Each of the 47 receiving practices called. The lab asked them to mark the messages as received in error in their HIS and to confirm they had not opened the attached PDF rendering.
Affected patients identified, one BSN per ORU, deduplicated to 612 individuals. Autoriteit Persoonsgegevens notified inside the 72-hour breach window per GDPR Article 33.
The Q2 LOINC table re-fetched from origin with a SHA-256 hash check against a value posted out-of-band on the supplier portal. Channel redeployed. The 1,400 queued messages reprocessed.

The lab was back online at 18:40. Total downtime: just under seven hours. Total cost, including staff hours, legal counsel, the AP notification, and the patient outreach campaign that followed, landed around €34,000 against an annual integration budget of €11,000.

The AP filing was a four-page form. Most of the time went into explaining the technical chain in language a non-technical reviewer would accept. The lab sent a Dutch-language letter to all 612 patients the following week, naming the receiving practice, confirming that the message had been recalled, and offering a direct line for questions. Eleven patients called back. None filed a complaint. The AP closed the file in August with no further measures, noting that the corrective controls described in the response addressed the root cause. The lab's professional liability insurer was notified the same day as a precaution and never billed against.

The diff-gate we now run

The fix was not a more careful deploy script. We wrote one of those too, but the lesson from Friday was that the channel itself should refuse to send a message it cannot justify. The channel now resolves every outbound ORU twice, once against the current reference table and once against the previous quarter's frozen copy, and halts the message if the two disagree.

The Transformer step that does the work is roughly this:

// Mirth Connect Transformer step
// Resolves the destination AGB against both the current and previous tables
// Halts and routes to manual-review if the tables disagree or look identical

var loincCode  = msg['OBX']['OBX.3']['OBX.3.1'].toString();
var defaultAGB = msg['ORC']['ORC.21']['ORC.21.10'].toString();

var current  = LOINC_CURRENT.resolve(loincCode, defaultAGB);
var previous = LOINC_PREVIOUS.resolve(loincCode, defaultAGB);

if (current.releaseId === previous.releaseId) {
  channelMap.put('halt_reason',
    'reference table did not advance, possible cache hit');
  router.routeMessageByChannelId('manual-review-queue', msg);
  return;
}

if (current.agb !== previous.agb) {
  channelMap.put('halt_reason',
    'AGB drift: ' + previous.agb + ' -> ' + current.agb +
    ' (loinc ' + loincCode + ')');
  router.routeMessageByChannelId('manual-review-queue', msg);
  return;
}

if (current.agb == null || !/^\d{8}$/.test(current.agb)) {
  channelMap.put('halt_reason', 'AGB unresolved or malformed');
  router.routeMessageByChannelId('manual-review-queue', msg);
  return;
}

Three checks. First, and the most important, the gate refuses to send if the two reference tables share a release identifier. That single check would have caught Friday's incident before message one. Second, if both tables resolve to a different AGB for the same LOINC code, the message goes to manual review. That catches genuine routing changes between quarters and forces a human to sign them off before they roll out at scale. Third, a malformed or missing AGB is treated as a hard fail.

The manual-review queue is a separate Mirth channel that writes to a Postgres table and pings a Teams webhook. The lab's quality manager works through it twice a day. In the first week the gate halted 41 messages. 39 were legitimate routing changes, the metabolic panel migration approved retroactively in batch. Two were a separate bug in the LIS export where an AGB field had been left blank because the patient had moved practice the day before. Both were resolved without a misroute, which is the entire point of the gate. The 41 halts were not 41 incidents. They were 41 confirmations that the channel was paying attention.

The deploy script itself was rewritten the following Monday. It now fetches the supplier-portal manifest first, reads the SHA-256 of the expected file from the manifest, downloads the file, verifies the hash matches, and only then atomically swaps it into place. A hash mismatch aborts with a non-zero exit code that the channel manager surfaces in the next health-check window. The supplier URL was also added to the proxy's no-cache list, with a comment in the proxy config that points back to the incident ticket. None of those changes would have helped without the in-channel gate, but together they cover both the deploy-time and the run-time failure modes.

What the gate does not solve

The gate covers reference-table drift. It does not cover the case where both tables agree but both are wrong, and it does not cover patient-level routing errors where the LIS sends the wrong BSN against an otherwise correct AGB.

The lab built a second gate to cover that case. It compares the BSN in the PID-3 field against a nightly snapshot of the patient registry behind the destination AGB. If the BSN is not in the registry, the message goes to the same manual-review queue. The snapshot is built from an authorised LSP query and held only as a hash table keyed on BSN, never the full record, so the gate never has to read identifying data it does not need. In the four months since it went live the second gate has halted twelve messages. Ten were patients who had switched GPs without telling the lab. One was a typo in a manually entered AGB on a referral form. One was a genuine LIS bug where two consecutive patient rows had swapped BSNs during an overnight batch export. None reached a receiving practice.

What we did not do: we did not swap integration engines, we did not introduce a message-broker abstraction layer, and we did not rewrite the LOINC mapping in Rust. The incident was a deploy-pipeline failure, not an architectural one. LOINC is fine. Mirth Connect is fine. What the lab lacked was a check that today's outbound traffic was different from yesterday's in the way they expected.

The smallest version of this you can run on Monday

You do not need Mirth Connect, HL7v2, or a clinical lab to apply this. If you have any pipeline that depends on a reference table, a tax-rate file, a shipping-zone matrix, a feature-flag JSON, a price book, write a ten-line script that loads the current and previous versions and refuses to deploy when they share a version stamp or when their diff is empty. Have it fail loud. Have it page someone.

The script does not need to understand the contents of the file. A SHA-256 of the parsed payload is enough to tell you whether today's deploy is shipping the same bytes as yesterday's. A version field somewhere inside the payload tells you whether the bytes that did change were the ones you intended to change. Either check on its own catches most failures of this kind. The two together make a quiet failure almost impossible.

When we built the process automation for this lab, the thing we ran into was that every layer of the pipeline was correct in isolation: the deploy script worked, the channel deployed cleanly, the messages parsed, the recipients existed. The failure lived in the gap between layers. The diff-gate is how we close that gap.

Today's audit takes five minutes. Open one of your scheduled reference-data deploys and ask out loud, “how would I know if this just shipped yesterday's version of the data?”, then write down the first answer you cannot defend. If the answer is a passing health check, a green build, or a deploy script that returned zero, that is not an answer. That is a syntax check pretending to be a sanity check. Pick the first deploy you cannot defend and run a manual diff against last quarter before the next scheduled refresh.

Key takeaway

If your deploy can succeed while shipping yesterday's reference data, you do not have a working deploy. You have a working syntax check.

FAQ

What is an ORU^R01 message?

ORU^R01 is the HL7v2 message type used to send unsolicited observation results, typically lab results, from a sender like a LIS to a receiver like a GP information system over a clinical network.

Why did the deploy script not catch the stale file?

The script verified that the file existed and parsed as JSON. It did not verify the release identifier inside the file, so a cached previous-quarter response from the proxy looked identical to a successful update.

Does the diff-gate add latency to message routing?

Two lookups against in-memory reference tables add roughly one millisecond per message in this channel. Throughput stays inside the lab's existing SLA against the GP-side receivers.

Is a 72-hour AP notification always required for a misroute?

Under GDPR Article 33, a personal data breach must be reported to the supervisory authority within 72 hours unless it is unlikely to result in a risk to data subjects. Misrouted clinical results almost always meet that bar.

process automationintegrationsarchitectureoperationsworkflowcase study

Building something?

Start a project