← Blog

Process automation

Idempotent webhooks: 87 ghost container slots at ECT Delta

On Monday 9 June, 87 confirmation emails from ECT Delta hit a Rotterdam forwarder's inbox before 07:14. Three lines of code, eleven days earlier.

Jacob Molkenboer· Founder · A Brand New Company· 18 Jun 2026· 10 min
Brass desk bell, fanned cream receipts with chartreuse ribbon and vintage clip on linen blotter, side light.

The dispatcher at a 21-person freight forwarder in Rotterdam opened her inbox at 07:14 on Monday, 9 June, and saw 87 confirmation emails from ECT Delta. Same containers. Different slot IDs. Some slots were 40 minutes apart on the same chassis. The first call she made was to her terminal planner. The second was to us. The cause was three lines of code merged eleven days earlier, sitting on the wrong side of the line between at-least-once delivery and exactly-once processing.

By 09:30 the team had cancelled 71 of the 87 ghost bookings, paid €4,250 in late-cancellation fees to the terminal, and missed a CSCL vessel's pickup window on three reefers because the chassis those reefers needed had been pulled to a duplicate slot and never recovered in time. The remaining sixteen slots could not be cleanly cancelled before their window closed; we paid for those as well. Total damage by lunchtime: roughly €6,800, plus a customer who wanted to know why his temperature-sensitive cargo was now sitting on tomorrow's manifest.

This is the walkthrough. We're sharing it because the same shape of failure is sitting in a lot of terminal integrations right now, and the fix is mechanical.

The Monday timeline

Friday 30 May, 16:42 — a junior on our side ships a change to the company's process-automation agent. Portbase had pushed a Cargo Information Service notification for an inbound container; the agent claims a slot at ECT Delta, files the PIN, and writes the booking back into the TMS. Standard flow. The junior moves the webhook handler off Lambda direct invocation and onto an SQS Standard queue "for retry safety." It passes review. It passes integration tests.

Monday 9 June, 02:11 — Portbase begins delivering notifications for a heavier-than-normal Monday turn. SQS does what SQS Standard queues do: it delivers at least once. For seventeen of the messages that morning, the broker fans the same notification out twice, three times, in one case six times. The agent processes each delivery as a fresh job and claims a fresh slot. By 07:00 the terminal has 87 booked appointments against 39 containers.

Monday 9 June, 06:51 — the on-call engineer's pager fires. Not from our monitoring; ours saw nothing wrong. The page came from the TMS team, who had noticed an unusual count of slot IDs appearing against a single container. By the time anyone looked at the agent's logs, the damage was done. The agent had been doing exactly what it was told.

The failure is not exotic. It is the most documented behaviour in AWS:

"Standard queues provide at-least-once delivery, which means that each message is delivered at least once. Occasionally (because of the highly distributed architecture that allows nearly unlimited throughput), more than one copy of a message might be delivered out of order."

AWS SQS Developer Guide

The junior knew that. Everyone knows that. The thing he missed is that "exactly-once" is not a property of the queue. It is a property of the consumer. And our slot-claim endpoint had nothing on it.

Anatomy of the duplicate

Here is the bad version, simplified to fit on a page.

async def handle_portbase_event(msg):
    payload = json.loads(msg.body)
    container = payload["equipment_number"]
    slot = await ect.claim_slot(
        container=container,
        window=payload["preferred_window"],
    )
    await tms.write_booking(container, slot.id)
    await msg.delete()

There are three problems and they compound.

First, ect.claim_slot has no idempotency key. ECT Delta's slot API will happily issue you a new slot for the same container every time you call it. From the terminal's side, you are 87 different shipments that all happen to share an equipment number.

Second, the write to the TMS happens after the slot claim. If the consumer crashes between the two calls, which it did, twice that morning, on a database connection blip, SQS redelivers, the next worker claims another slot, and the TMS never learns about the first one.

Third, msg.delete() runs only on the happy path. Any thrown exception lets the message return to the queue after the visibility timeout. By design. With nothing else in place, every retry is a fresh booking.

The fix is not "switch to FIFO with content-based deduplication." FIFO would have helped on the duplicate-delivery side, but the underlying agent would still be capable of claiming two slots for the same container on two distinct messages: a redrive, a manual replay, a webhook re-fired by Portbase's own retry logic two hours later. The fix has to live at the consumer.

The four-step exactly-once gate

After the incident we wrote the gate down and now run it on every terminal integration. Four steps, in this order. Skipping any of them re-opens the hole.

1. Derive a stable idempotency key at the edge

The key is not the SQS message ID. SQS message IDs are unique per delivery, not per event. Use a key the upstream system controls. For Portbase CIS, that is the notificationId on the envelope; for ECT slot claims, we hash (equipment_number, requested_window_start, agent_run_id).

def idempotency_key(event):
    raw = f"{event['notificationId']}:{event['equipment_number']}"
    return hashlib.sha256(raw.encode()).hexdigest()

The agent attaches this key to every downstream call it makes for this event. Same event, same key, every time, forever.

2. Reserve the key before any side effect

A small Postgres table with a unique index. Or DynamoDB with a conditional PutItem. The point is that the insert fails if the key already exists, atomically, with no read-then-write race.

INSERT INTO idempotency_log (key, status, created_at)
VALUES ($1, 'in_flight', now())
ON CONFLICT (key) DO NOTHING
RETURNING key;

If RETURNING gives you nothing, another worker already owns this event. You acknowledge the message and exit. You do not claim a slot. You do not write to the TMS. You do nothing.

Warning

The reservation must happen before the first external call, not after. We have seen teams put the dedup check after the slot-claim "to be sure the work succeeded first." That is the bug. You cannot un-claim a terminal slot without paying for it.

3. Pass the key through to the terminal

ECT Delta accepts an Idempotency-Key header on the slot endpoint. Most modern terminal APIs do, and the broader industry pattern is captured in the IETF draft on the Idempotency-Key HTTP header. Send the key. If the terminal sees the same key twice within its retention window, it returns the original slot, not a new one. This is your belt against a redelivery that somehow slipped past step 2: a clock skew, a replica lag, a manual replay against a forgotten queue.

slot = await ect.claim_slot(
    container=container,
    window=payload["preferred_window"],
    headers={"Idempotency-Key": idem_key},
)

If the terminal does not support this header, the gate stops here and you escalate to the integration owner. Do not paper over it with retries on your side. We have one terminal we still cannot integrate cleanly against because of this; we route those bookings to a human queue and accept the cost.

4. Commit the outcome and the key in one transaction

Write the TMS booking and update the dedup row's status to committed in the same database transaction. If the worker dies mid-flight, the next retry finds the key in in_flight state and runs a small reconciliation: query the terminal for the slot the key owns, write it to the TMS, mark committed. No new slot is ever claimed for a key that already has one.

BEGIN;
  UPDATE idempotency_log
     SET status = 'committed', slot_id = $2
   WHERE key = $1;
  INSERT INTO tms_bookings (container, slot_id, source_event)
       VALUES ($3, $2, $1);
COMMIT;

Reconciliation is the part most teams skip because it feels like over-engineering until the first time it saves you. On our gate, a redrive worker wakes every minute and looks for rows still in_flight after two minutes. For each one it calls the terminal's slot-lookup endpoint with the original idempotency key, takes the slot the terminal already has on file for that key, and finishes the original transaction. We have triggered this path six times in two months — three database failovers, two deploys that landed mid-message, one network partition. In every case the correct slot showed up in the TMS within ninety seconds, with no human input.

What you watch after the gate is live

A gate fails silently if you let it. A bug in the key-derivation function, a clock skew on the dedup store, a deploy that drops the header on outbound calls — none of these throw. They just stop deduplicating, and the next bad Monday looks exactly like the last one. We added three signals after the incident and have caught two regressions with them since.

The first is the ratio of ON CONFLICT DO NOTHING returns to total events. On a steady-state Portbase feed this sits at 1–3%: Portbase re-fires perfectly good notifications at a low background rate, and you want to see that. A drop to zero usually means key derivation broke. A spike to 20% means an upstream system is misbehaving and you want the page.

The second is the age of the oldest in_flight row in the idempotency log. Anything older than the consumer's longest plausible processing window is a stranded job that reconciliation should already have closed. We alert at ten minutes.

The third is the number of distinct slot IDs the terminal returns for the same key within the dedup window. A healthy gate gets exactly one. Two means either the terminal forgot its own idempotency cache or we sent the wrong key, and both are worth investigating before they accumulate into another inbox of 87.

What changed for the forwarder

We deployed the gate within 36 hours of the incident. Two months on, the same agent has processed 14,800 Portbase notifications with zero duplicate bookings. The dedup table sits at 180 MB. The latency cost is one Postgres roundtrip, call it 4 ms at the p99, which matters less than not paying €4,250 on a Monday.

The wider point: at-least-once is a contract every modern queue ships with. SQS Standard, Pub/Sub, Kafka, Portbase's own webhook retry, ECT Delta's UI replays. You will see the same event again. The only question is whether your consumer is ready when you do.

Takeaway

Exactly-once is not a queue feature. It is discipline at the consumer: derive a key, reserve it, pass it through, commit it transactionally.

When we built the AI agents for that Rotterdam forwarder, the thing we ran into was that nobody — not us, not the terminal, not Portbase — owned the end-to-end exactly-once story. We ended up writing the gate as a small library every one of our terminal integrations now imports on day one, and we audit for it on every code review.

If you run a terminal integration today, the smallest useful thing you can do this afternoon: grep your consumer for the call that books, charges, or claims something external, and check whether the line above it reserves an idempotency key. If it doesn't, you have the same bug. Fix that one first.

Key takeaway

Exactly-once is not a queue property. It is a discipline at the consumer: derive a key, reserve it, pass it through, commit transactionally.

FAQ

Why not just switch from SQS Standard to SQS FIFO?

FIFO with content-based dedup helps within a five-minute window, but the underlying agent can still claim two slots for the same container on two genuinely distinct events: a manual replay, a Portbase re-fire two hours later, a redrive from DLQ.

Where should the idempotency store live?

Wherever you can do an atomic conditional insert. We use Postgres with a unique index on the key column. DynamoDB with a conditional PutItem works equally well. Redis SETNX is fine for short windows but loses the audit trail.

What if the terminal API does not accept an Idempotency-Key header?

Stop at step 2 and route those bookings through a human approval queue. Do not retry on your side. Without server-side recognition of the key, retries become a guaranteed duplicate the moment your dedup store is wrong.

How long do we keep idempotency keys around?

Long enough to outlive every upstream retry policy you depend on. For Portbase CIS plus SQS plus operator replays, 14 days is the floor. We keep 90 days for forensics and partition the table by month for cleanup.

process automationai agentsintegrationsarchitecturecase studyworkflow

Building something?

Start a project