Process automation

Idempotency in healthcare webhooks: 312 routes double-booked

On Sunday 26 October 2025, somewhere between 02:00 and 03:00 local time, a Nedap Ons webhook quietly re-fired. By Wednesday morning, 312 nursing routes had been booked twice.

Jacob Molkenboer· Founder · A Brand New Company· 5 May 2026· 11 min

Brass desk bell on ivory paper beside two stacked carbon-copy booking slips with a green sticky tab and paperclip.

The planner at a 24-person thuiszorg west of Eindhoven opens her laptop at 06:40 on Wednesday and looks at the route board for the day. The first thing she notices is that Annelies, one of her wijkverpleegkundigen, has been assigned twelve clients before noon. Annelies does eight on a normal day. The second thing she notices is that every other route on the board has the same shape: doubled. The board claims 624 visits before lunch, on a team that physically handles 312.

She calls us at 06:43.

This is the walkthrough of what we found, what broke, and the small piece of code that now sits between every ECD mutation and the VECOZO tunnel.

The system we built

The planning-agent we shipped in March routes wijkverpleegkundigen across the city on a rolling 14-day horizon. It pulls the dienstrooster from Nedap Ons, indicaties from the ECD, traffic forecasts from a public NDW feed, and constraint preferences from the planner (who lives where, who refuses certain wijken after dark, which clients only accept the same nurse on Mondays). It writes back to Nedap Ons via the standard Ons API, which travels, like every healthcare mutation in this country, through the VECOZO tunnel.

Why the agent exists at all: scheduling 24 nurses across roughly 90 indicated clients on a rolling fortnight is a constraint problem the historic planner used to solve on Friday afternoons in four hours of dragging coloured blocks around a spreadsheet. The agent produces a Pareto front of three options in about two minutes and lets the planner pick. That is the value. Everything below this paragraph is about what happens when the gate around the value is wrong.

The agent gets its trigger from a Nedap Ons webhook. Each time the Q4 schedule changes (a sick day, a swap, a new client) Ons posts the diff and the agent re-plans the affected window.

The webhook contract, paraphrased from the docs:

Events are delivered at-least-once. Consumers must deduplicate using the event_id field. We retry on 5xx and on connection timeouts up to 24 hours.
Nedap Ons API, webhook delivery section

We had read this. We had not honoured it correctly.

What happened on 26 October

The night of 26 October 2025 was the night the clock went back. Europe rolled from CEST (UTC+2) to CET (UTC+1) at 03:00 local, falling to 02:00. The hour between 02:00 and 03:00 local time happens twice on that night.

At 02:47 CEST, the planner pushed a swap into Ons (Annelies and Inge trading two routes). Ons fired the webhook. Our handler consumed it, wrote a re-plan back through VECOZO, and acked the event.

At 02:13 CET (27 wall-clock minutes later, but the same printed local time) Ons fired what it believed was the same event again. We have not had the full post-mortem from Nedap, but the working theory is a known class of retry-during-DST bugs where the retry queue groups events by a local-time window and either fires twice or drops events that fall inside the duplicated hour.

We reconstructed the timeline from three sources: Nedap's webhook delivery log, our own structured logs (we keep full payloads for 90 days), and the ECD's audit trail. The three agreed on every event but disagreed on its timestamp by between zero and 47 minutes, depending on which system's clock had been authoritative at that moment of the roll-over. The audit trail won.

Our handler also keyed its dedupe check on a local-time window: WHERE received_at BETWEEN now() - interval '1 hour' AND now(). From the agent's perspective the second event landed outside that window. We treated it as new.

Re-plan ran. New routes were written. Ons acked them. Both copies sat in the ECD as legitimate dienst-mutaties, with valid VECOZO signatures, indistinguishable from a real swap.

The first eighteen webhook events on the morning of the 27th (Monday, a new client intake among them) were then processed against a now-doubled schedule. The planner's "swap two routes" became "swap two of four, the other two stay". The agent did exactly what it was told to do.

By Wednesday morning the divergence had compounded to 312 doubled routes.

Warning

Any webhook contract that says "at-least-once" means your dedupe gate has to be unconditional and time-independent. If your dedupe window is a wall-clock interval, you have written a clock-skew bug and DST will find it.

The roll-back

We rolled back the schedule from the previous night's signed snapshot at 07:20. The planner re-applied her real changes by hand by 09:00. No client was visited twice; the on-duty nurses caught the duplicates at the morning hand-over because no human believes Annelies does twelve clients before lunch.

The signed snapshot is something we now ship in every agent we build: a hash-chained dump of the agent's understanding of the world, written to S3 with Object Lock every fifteen minutes. To roll back, we replay the mutations between the last good snapshot and now in dry-run, diff against the live state, and produce a list of compensating mutations the planner approves in one click. The 07:20 roll-back took eleven minutes of wall-clock time, of which nine were spent on the planner reading the diff.

Then we wrote the gate.

The gate

The gate sits in front of every mutation the planning-agent emits. It runs before the VECOZO envelope is built and signed. It does two things: it refuses to forward an Ons event we have already acked, and it refuses to forward any mutation whose source timestamp is more than a configurable skew from our monotonic clock.

// gate.ts — runs in front of every ECD mutation
import { createHash } from "node:crypto"
import { db } from "./db"

const MAX_SKEW_MS = 5 * 60 * 1000        // 5 minutes
const REPLAY_WINDOW_DAYS = 30

export async function gate(event: OnsEvent, mutation: EcdMutation) {
  // 1. Idempotency on the source event_id. Not a derived hash,
  //    not a content fingerprint, not the receipt timestamp.
  const key = event.event_id
  const seen = await db.oneOrNone(
    `SELECT acked_at FROM webhook_seen
       WHERE event_id = $1
         AND received_at > now() - interval '${REPLAY_WINDOW_DAYS} days'`,
    [key],
  )
  if (seen) {
    return { decision: "drop", reason: "replay", first_seen: seen.acked_at }
  }

  // 2. Clock-skew: compare event.occurred_at (strict UTC, from Ons)
  //    against our monotonic UTC clock. Reject if drift > MAX_SKEW_MS.
  const skewMs = Math.abs(Date.now() - Date.parse(event.occurred_at))
  if (skewMs > MAX_SKEW_MS) {
    return { decision: "quarantine", reason: "skew", skewMs }
  }

  // 3. Reserve the key in the same transaction as the mutation.
  //    If the ECD write fails, the row rolls back and we can retry.
  return db.tx(async t => {
    await t.none(
      `INSERT INTO webhook_seen (event_id, received_at, acked_at, mutation_hash)
       VALUES ($1, now(), now(), $2)`,
      [key, mutationHash(mutation)],
    )
    await sendToVecozo(mutation)
    return { decision: "forward" }
  })
}

function mutationHash(m: EcdMutation) {
  return createHash("sha256").update(JSON.stringify(m)).digest("hex")
}

Five things to call out: the first three because we got each of them wrong in the first draft, the last two because they are the parts of the story that get skipped.

The replay window is 30 days, not 24 hours

Nedap's stated retry window is 24 hours, but we have observed retries from cold storage after a Nedap-side incident at the 11-day mark. Thirty days of webhook_seen rows is cheap; a duplicated visit on a real client is not.

occurred_at must be strict UTC at the source

Ons emits ISO-8601 with an offset, so we get UTC information for free, but the first draft parsed it through a local-time helper that resolved ambiguous times to the first occurrence in the duplicated DST hour. We now parse strictly and reject any timestamp without an explicit offset. There is no "02:30 local, you decide" branch anywhere in the handler.

The dedupe insert and the VECOZO send live in the same transaction

If the send fails, the row rolls back. If the row insert fails (unique violation on event_id), we never send. There is no window in which a retry could see "ack written, send not made" or "send made, ack not written". This is the same pattern Stripe documents for its own idempotency keys, and the reason it exists is the same: at-least-once delivery is the default because exactly-once is a famously expensive distributed-systems problem.

What the gate does not do

It does not, on its own, catch a content-level duplicate that arrives with a fresh event_id. If Nedap ever re-emits the same swap with a new id (we have not observed this, but cannot rule it out), the gate will pass it through. The mutation_hash column exists for the second-level check we run nightly: it flags any pair of mutations with the same hash and the same target client within a 72-hour window, and quarantines both for the planner to inspect. The in-line gate is the cheap, fast check; the hash-pair scan is the slow, out-of-line one. They cover different failure modes and we run both.

How we test it

Every webhook payload we have ever received is replayable from a recorded fixture. The CI suite replays the full 26 October incident, the eighteen events that compounded it on the Monday morning, and a synthetic set of clock anomalies: an NTP step forward, an NTP step back, a container clock drift past MAX_SKEW_MS, a missing event_id, a malformed occurred_at. A green build means the gate handles all of those. A red build blocks deploy.

Why this matters more in regulated work

Regulatory pressure on agentic systems is no longer theoretical. The EU AI Act places agents that take action against patient records in the high-risk tier, with logging, traceability and human-oversight requirements rolling in through August 2027. The Dutch WGBO already requires that every entry in a patient's dossier be attributable to a named author and to a precise moment. A duplicated VECOZO mutation, in that frame, is not a billing inconvenience. It is an audit-trail integrity failure, and your agent is the named author on both copies.

If you are operating an AI agent against a regulated record system, the part of the stack you cannot cheap out on is the boring part. The agent itself is a glorified prompt with tools. The thing that decides whether you can stay in business is the gate in front of the tool calls.

For a payments platform, the equivalent of a duplicated webhook is "billed twice". For an ECD, it is "visited twice", or worse, "scheduled and not visited because the agent thought the slot was already done".

The cost asymmetry is the part most teams underestimate. The gate above is roughly eighty lines of code, sits behind a feature flag, and runs in under two milliseconds per event. The post-incident audit, the calls with the ECD vendor, and the explanation we owed our client took three weeks of senior time across two countries. The gate is the cheapest part of this whole story.

A five-minute audit for your own stack

Open the handler that consumes your most expensive webhook, the one that triggers writes to a record system you do not own. Search for the dedupe check. Ask four questions.

One: is the dedupe key the provider's event_id, or something you computed? If you computed it, your check is wrong on any retry where the payload differs by a single whitespace.

Two: is the dedupe window a wall-clock interval? If yes, the next DST transition or the next NTP step will produce a duplicate.

Three: is the dedupe insert in the same transaction as the downstream mutation? If not, there is a window where you have one but not the other.

Four: when the gate drops or quarantines an event, do you log it with the reason and surface it on a dashboard the on-call human looks at? A silent gate is a worse failure mode than a missing one. The first thing we ship with every new gate is the row in the operations dashboard that goes red when the drop count is non-zero.

If any of those answers is wrong, your next incident is already in the queue. It is waiting for a clock to do something weird, for a vendor to flush a backlog, or for a hardware NTP daemon to step instead of slew.

When we built the planning-agent for this Eindhoven thuiszorg, the part we underestimated was not the routing maths or the constraint solver. It was the contract between two systems we did not own. The gate above is the smallest version of that contract we could write. If you are running an AI agent against a healthcare or finance record system and want a second pair of eyes on the boundary, that is the work we do.

Key takeaway

An at-least-once webhook plus a wall-clock dedupe window equals a clock-skew bug, and DST will find it before any other clock anomaly does.

FAQ

What is VECOZO?

VECOZO is the Dutch healthcare communication backbone. Every mutation to an electronic patient record (ECD) travels through its signed tunnel between providers, insurers and registration systems.

Why is at-least-once webhook delivery the default?

Because exactly-once delivery across two systems requires distributed consensus, which is expensive and slow. At-least-once plus consumer-side idempotency is the standard trade-off used by Stripe, AWS, GitHub and most others.

How long should a webhook dedupe window be?

Long enough to cover the provider's documented retry window, with margin for cold-storage replays. Thirty days is a sensible default; one hour is not. The storage cost of seen-event rows is negligible.

Does this only matter at DST transitions?

No. DST is the most predictable clock anomaly, but NTP steps, leap seconds, container clock drift and provider-side retry-queue bugs all produce the same class of duplicate. The gate has to be time-independent.

process automationai agentsintegrationsworkflowoperationscase study

Building something?

Start a project