← Blog

Email automation

Email agent post-mortem: a stale RAG index, 340 bad replies

At 10:47 on a Tuesday, a Utrecht customs broker found 340 confident email replies built on tariff data retired in January. Here is how, and how we stopped it.

Jacob Molkenboer· Founder · A Brand New Company· 7 Nov 2025· 9 min
Cream envelope tied with green ribbon, brass pocket watch, and carbon-copy slips on a forest leather blotter.

At 10:47 on a Tuesday morning, a customs broker on the second floor of a 28-person Utrecht freight-forwarder opened her Outlook search box and typed the name of a Polish importer. She found seventeen replies the email agent had sent overnight. Every one of them quoted an HS code that had been retired in the January 2026 EU tariff update. None of the customers had written back yet, which was the only piece of good news.

By the time the operations lead called us, the count was 340. Three hours of confident auto-replies, every one of them wrong in a small but expensive way. This is what happened, why nothing fired, and the four-step pre-deploy gate we now run on every retrieval layer that touches douane data.

What the agent was doing

The freight-forwarder runs douane-aangiftes through the AGS portal for around forty importers a month. Their email agent had been live for nine weeks. We had built it to triage douane-vragen, the day-to-day questions importers send about classification, BTW codes, and origin documents. Roughly 60% of those questions repeat enough that a RAG-backed agent can answer them within minutes instead of hours. The rest get routed to the operations team.

The retrieval layer indexed two corpora: the European tariff database (HS codes, duty rates, BTW mappings) and the freight-forwarder's own internal notes (Incoterms quirks per importer, recurring shipment patterns, classification precedents). The agent stitched the two together and drafted a reply. A human approved anything above €15,000 declared value. Below that threshold, low-confidence answers got held for review. High-confidence answers went straight out.

The hot-swap

A junior developer on the freight-forwarder's side had inherited the index a fortnight earlier. The team asked him to refresh the HS code corpus with the new tariff numbers and the latest goederencode-to-VAT mappings. He did it carefully. He pulled the new dataset from the shared drive, ran the embedder, swapped the FAISS index in place, restarted the worker, and watched the health check go green.

He did not bump the embedding-model version string. The model itself had not changed, only the corpus underneath it. From the agent's perspective, nothing had moved. The query embedder, the cache layer, the confidence threshold, every one of them said "still version 4.2, all clear."

What he missed: the file he pulled from the shared drive was a stale export from August 2025. The customs lead had updated the canonical file in October but never re-uploaded it to the document store. So the fresh index he built was eight months behind the live TARIC data. The version tag gave no signal at all.

Why nothing fired

This is the failure mode we worry about with retrieval layers. There is no dimension mismatch when only the underlying documents shift. The retriever returns plausible chunks. The agent generates fluent Dutch text. The confidence scores look the way they always do. No exception lands in Sentry, no PagerDuty page, no Slack alert.

The agent kept replying. For HS code 8517.62.00 in particular (routers and wireless network gear), it gave an outdated duty rate to nineteen separate customers in three hours. The actual rate had moved by 1.4 percentage points. Small enough that nobody flagged it in real time. Big enough that on a €40,000 shipment of network switches, a customer would walk away thinking they could clear at €560 less than they actually could. Multiply that across 340 threads and you arrive at the size of the bill we did not want to write.

Warning

The failure mode is silence, not noise. If your retrieval layer's health checks only fire on errors, you do not have health checks. You have a smoke detector wired into the same circuit as the fire.

The four-step pre-deploy gate

We rolled the index back at 11:12 AM and sent corrections to every affected thread before the end of the day. The rollback took an afternoon. The pre-deploy gate took a week to build, and we now run it on every retrieval layer the freight-forwarder ships, including the ones that never touch customs data. The pattern transfers.

1. Corpus diff against the previous snapshot

Before the new index goes live, we run a diff against the last known-good snapshot. Not a byte diff. A semantic diff: how many documents are added, removed, or changed by more than 30%, and what is the date range of the source files. If more than 10% of the corpus is older than the previous snapshot's median document age, we hard-stop the deploy. That single rule would have caught the August export.

from statistics import median


class DeployBlocked(Exception):
    pass


def corpus_age_check(new_docs, prev_snapshot):
    prev_median = median(d.source_date for d in prev_snapshot.docs)
    older = sum(1 for d in new_docs if d.source_date < prev_median)
    ratio = older / len(new_docs)
    if ratio > 0.10:
        raise DeployBlocked(
            f"{older} of {len(new_docs)} docs older than "
            f"previous median ({prev_median.isoformat()}). "
            f"Hard stop. Re-pull the source files."
        )

2. Fifty golden queries

A small set of questions with known-correct answers, frozen in a YAML file, owned by the customs lead (not by engineering). Every pre-deploy run answers all fifty. If more than two answers disagree with the gold-standard, the deploy stops and a human reviews the diff. The set updates quarterly, never silently. We picked fifty because that is the number the customs lead is willing to re-review in one sitting. The right number for your team is whatever they will actually look at.

3. Manifest with content hash and model card

Every index now carries a JSON manifest. The query path reads it at boot. If the manifest's embedding-model-version is bit-for-bit identical to the previous index's, the deploy script forces the developer to either bump the version or pass a --no-bump flag with a written reason. The flag fires a Slack message into the engineering channel. It is, deliberately, hard to slip past.

{
  "embedding_model": "text-embedding-3-large",
  "embedding_model_version": "2026-06-17.1",
  "corpus_content_hash": "sha256:7c4f9a2e...",
  "corpus_source_date_range": ["2026-01-01", "2026-06-15"],
  "document_count": 8412,
  "built_by": "thijs@example.nl",
  "built_at": "2026-06-17T09:14:00Z"
}

4. Shadow run for ten minutes

The new index goes up beside the old one, not in place of it. For ten minutes, every live query hits both. The two answer sets are scored against each other on simple semantic similarity. If the divergence rate is above 5%, the new index does not get promoted. The old one keeps serving. A human gets paged.

Ten minutes is enough for this freight-forwarder because Tuesday morning is their peak. For lower-volume corpora, we run the shadow against a recorded query log from the previous week.

The cleanup

340 replies in three hours is small enough that a human can write personalised follow-ups. We did. The corrections went out in batches of fifty, each one signed by the customs lead, each one starting with the same first sentence: "We need to correct an answer we sent you earlier today." No hedging, no "due to a technical issue." Specific HS code, specific revised duty rate, specific number.

Two customers replied the next day asking how we caught it. Both stayed.

What we kept and what we changed

We kept the agent. We kept the RAG pattern. The volume of douane-vragen the freight-forwarder gets is real, and the cost of having a human read every classification question for the third time that week was higher than the cost of one bad Tuesday. What we changed was the assumption that the model not moving is the same as the answer not moving.

The customs lead now signs off on every gate output before a deploy ships. She does not write code. She reads the diff, eyeballs the golden-query results, and ticks a box. That five-minute checkpoint is the most important part of the system. The gate exists so that her review is on a small, comprehensible delta, not on the entire corpus.

Since rollout the freight-forwarder has shipped 23 index updates through the gate. Two were blocked. One had a stale export similar to the original incident, caught at corpus diff. The other was a clean update where the customs lead spotted a wording shift in a BTW interpretation note that engineering had not flagged. The agent would have replied accurately, just in a tone she disliked. Small thing. Caught before it shipped.

The five-minute thing you can do today

When we built the email agent for this Utrecht client, the thing we ran into was that retrieval failures look identical to retrieval successes until somebody downstream catches them. We solved it by treating every index swap as a deploy and putting a four-step gate in front of it.

Open the manifest file for your RAG index. If it does not have a content hash, an embedding-model version, and a corpus date range, add them now. The gate is harder to build. The manifest is the foundation. Without it, you cannot tell whether yesterday's index is the same as today's, and you cannot roll back when it is not.

Key takeaway

If your RAG index has no manifest with content hash and embedding-model version, the model staying the same is not proof the answer stayed the same.

FAQ

What is a RAG hot-swap?

Replacing a retrieval-augmented generation index in place while the agent keeps running. A common pattern, and a risky one when the new index quietly disagrees with what the agent's version tags still claim.

Why does the embedding-model version matter if the model itself has not changed?

It is the marker downstream systems use to invalidate caches, alert humans, and trigger re-validation. If only the corpus changes but the version stays the same, every safety check assumes nothing happened.

What is AGS?

Aangifte Goederen Systeem, the Dutch customs declaration portal freight-forwarders use to file douane-aangiftes. It is being phased out for DMS 4.0 over the next year, which is one more reason to keep your tariff data version-tagged.

How long should a shadow run last before promoting an index?

Long enough to cover the index's normal query distribution. For most email agents, ten to thirty minutes of live traffic at peak is enough. For low-volume corpora, run against a recorded query log instead.

Do you need all four gate steps, or is one enough?

The corpus diff alone would have caught this incident. The other three catch different failure modes: prompt regressions, silent model drift, and tone changes a customs lead would notice but engineering would not.

email automationai agentsragoperationscase studyautomation

Building something?

Start a project