Email automation

Email agent incident: 312 replies from a stale RAG index

A 26-person broker in Den Bosch sent 312 polisaanvragen the wrong coverage limit one Tuesday morning. The cause was a Pinecone namespace that quietly went stale during a reindex.

Jacob Molkenboer· Founder · A Brand New Company· 18 Jun 2026· 9 min

Cream envelope with chartreuse ribbon, carbon-copy slips, brass letter opener on forest-green leather blotter.

The first call came in at 09:47. A policy holder in Tilburg, calmly furious, wanted to know why his liability cover was now €250,000 instead of the €1,000,000 he had signed up for last month. The account manager pulled the email thread. There it was, in the agent's own friendly Dutch: "Wij bevestigen uw dekkingslimiet aansprakelijkheid van EUR 250.000." Confident. Specific. Wrong.

By 10:15, three more calls. By 10:40, the shared new-business inbox had 27 reply-all "huh?" threads from customers and brokers. We pulled the agent off the SMTP relay at 10:52. By then it had sent 312 of these.

The agent had been running for fourteen months without an incident worth writing down. The team trusted it for the routine reply path: incoming policy application, retrieve the matching voorwaarden, draft a confirmation, send through the relay with a from header pointing at the right account manager. The team trusted it enough to skip the "send to review queue" toggle for confirmations under €5,000 in annual premium.

This is the incident report. The fix at the end is the thing worth keeping.

The five-minute version

Monday night, our content team rebuilt the Pinecone index that holds the broker's product voorwaarden — about 4,800 chunks across 11 product lines. Reindexing is a quarterly chore. The script writes a new namespace, runs a smoke query, swaps the alias, then deletes the old namespace after a 24-hour grace period.

The smoke query passed. The alias swap reported success. The 24-hour grace passed. The old namespace was deleted on the next nightly cron.

But the alias swap never actually pointed at the new namespace. The Pinecone SDK call returned a 200, the wrapper around it logged swap_ok=true, and the alias on disk still pointed at the namespace that had now been deleted. Which Pinecone happily resolves, until it doesn't, by falling back to whatever the most recent namespace with that prefix was. In this client's project that was voorwaarden-2024Q3 — terms from before the September 2025 product refresh that lowered three of the standard cover limits by a factor of four.

The agent retrieved old chunks. The chunks looked authoritative. The model dutifully wrote a confirmation citing the old number. The relay sent it. 312 times, before anyone noticed the discrepancy in a human reply.

An index that returns answers is not the same as an index that returns current answers. A vector store can be stale and lively at the same time. Nothing in the retrieval call will tell you.

How the existing guardrails missed it

The agent was not unguarded. Before we walked into Tuesday, the pipeline had:

a Pydantic schema check on the model's structured output, so it could not hallucinate the shape of a confirmation;
a regex pass that refused to send any mail containing a euro amount that did not appear verbatim in the retrieved context;
a per-account-manager send-rate cap (twelve mails per minute) that would have caught a runaway loop;
a "send to review queue" toggle for premiums above €5,000.

None of them fired. The euro amount in the output did appear in the retrieved context — that was the whole problem. The context was wrong, but it was internally consistent with itself and with the draft. The Pydantic schema was satisfied. The rate cap was nowhere near tripped. The premium amounts on the affected polisaanvragen were all between €240 and €1,800 per year.

The lesson, sharper than we'd like: validating the model's output against its retrieved context cannot catch retrieval that is silently from the wrong source. The OWASP LLM Top 10 files this kind of failure under vector and embedding weaknesses, which is the polite way of saying you need a check on the source itself.

What the swap actually looked like

We went through the Pinecone control-plane logs the next morning with their support team. The relevant sequence, slightly simplified:

# Monday 23:14 — reindex job on the deploy server
pinecone-cli upsert \
  --index voorwaarden-prod \
  --namespace voorwaarden-2026Q2-rc1 \
  --file ./chunks-2026Q2.jsonl

# 23:41 — smoke query against the new namespace, BY NAME
pinecone-cli query --namespace voorwaarden-2026Q2-rc1 \
  --vector @./smoke.vec --top-k 3
# returns three chunks from the new namespace. OK.

# 23:42 — alias swap (wrapper around metadata update)
./swap-alias.sh voorwaarden-current voorwaarden-2026Q2-rc1
# logs: swap_ok=true

# Tuesday 03:00 — nightly GC, deletes namespaces older than 24h with no alias
pinecone-cli delete --namespace voorwaarden-2026Q1
# (the still-aliased namespace, because the alias never moved)

The swap-alias.sh script was wrong in a small way that had been wrong for at least two quarters. It updated a metadata field that Pinecone also exposes as alias, but is not the routing alias. The routing alias is set through a separate endpoint. The script's swap_ok=true was based on the HTTP 200 from the metadata update, which had nothing to do with where queries actually went.

The previous two reindexes worked anyway because the script also renamed the new namespace to the canonical name voorwaarden-current as a fallback, and queries against that name resolved correctly. This quarter, someone (one of us, honestly) cleaned up the script and removed the rename step because "the alias does that now". The alias did not, in fact, do that.

The dual-index swap-gate

You can solve this at three layers. We now solve it at all three, because each one would have caught the incident on its own, and we are not done being embarrassed.

Layer 1: fingerprint the index, stamp the retrieval

Every time we build a new namespace, we compute a SHA-256 over the sorted chunk IDs plus their content hashes and write it to a small Postgres table. The vector-store wrapper stamps that fingerprint onto every retrieval result. The SMTP relay refuses any mail whose retrieval fingerprint is not in a small allowlist updated by the deploy pipeline.

# rag/retrieve.py
def retrieve(query: str, k: int = 6) -> RetrievalResult:
    namespace = resolve_alias("voorwaarden-current")
    fingerprint = index_fingerprints.get(namespace)  # Postgres lookup
    chunks = pinecone.query(namespace=namespace, vector=embed(query), top_k=k)
    return RetrievalResult(
        chunks=chunks,
        namespace=namespace,
        fingerprint=fingerprint,
        retrieved_at=utcnow(),
    )

# relay/gate.py
ALLOWED_FINGERPRINTS = load_from_deploy_manifest()

def allow_send(mail: OutboundMail) -> bool:
    if mail.retrieval.fingerprint not in ALLOWED_FINGERPRINTS:
        hold(mail, reason="unknown_index_fingerprint")
        page_oncall()
        return False
    return True

Had this been in place Monday night, the relay would have held all 312 mails the moment the agent retrieved from the stale namespace, because that namespace's fingerprint was not on the allowlist the deploy pipeline had written.

Layer 2: dual-index disagreement check

For the first 72 hours after every reindex, we keep the previous namespace alive and run every production query against both the new and the old index. We compute a Jaccard similarity over the top-6 chunk IDs and a cosine similarity over the chunk-content embeddings.

If either drops below a per-product threshold (we tune this from the smoke set, usually ~0.6 Jaccard for stable products and ~0.4 for products that genuinely changed), the agent's draft goes to the human review queue with both retrievals attached. The reviewer sees side-by-side what the old and new index returned, and either approves the new behaviour or rolls back.

This is more expensive — two retrievals per mail — but only during the 72-hour overlap, and only for retrieval-grounded outbound. For this client that costs us about €11 per quarter in Pinecone read units. Acceptable.

Layer 3: a held-mail buffer with a content tripwire

The relay holds every retrieval-grounded mail for 90 seconds before sending. During that 90 seconds, a separate consumer watches the held queue. If more than 20 mails citing the same retrieved chunk go through inside that window, the queue pauses and pages. Twenty is roughly twice the legitimate burst from a Monday-morning new-business batch for this client. Yours will be different. The point is the tripwire watches what's in the mail, not how many mails.

Layer 3 would have caught this incident the moment the same wrong dekkingslimiet appeared in the 21st outbound draft. We would still have sent twenty bad mails, which is 292 fewer than 312 and well below the threshold where the right thing to do is call the customers personally.

Takeaway

If your RAG pipeline can only check the model's output against its retrieved context, you are checking that the agent is internally consistent — not that it is right. Add a check on the source.

What it cost the client

Three full days for the operations lead and one of the account managers, calling 312 customers and brokers, sending corrections, and re-issuing two policies that had already been booked on the wrong terms. No regulatory escalation — the AFM-relevant data (the IPID, the policy document itself) was correct in the attachment, and the body text disagreeing with the attachment was, legally, a clerical error.

Trust cost is harder to count. Two of the brokers in the IB-network the client works through asked, politely, what changed. We wrote them an honest one-pager. They stayed.

What we would not do again

We would not run a reindex Monday night before a Tuesday volume day. We now reindex on Friday morning, with a human watching the first hour of overlap traffic.

We would not trust a wrapper script's ok=true for an operation we cannot independently verify with a different call. The swap-alias.sh script now ends with a query through the alias for a canary chunk that exists only in the new namespace, and fails loud if the returned chunk's ID is not the one we just upserted.

We would not skip the review queue for any retrieval-grounded mail in the first 72 hours after a reindex, regardless of premium size. The cost of an extra reviewer click for three days is small. The cost of 312 confident, specific, wrong emails is not.

What to do today

Pull up the last successful retrieval log from your agent and ask: if the index it queried were silently swapped for an old copy, what in the current pipeline would notice? If the honest answer is "the customer who replies", you have the same gap we had on Monday morning. The smallest useful first step is layer 1 — fingerprint the index, stamp the retrieval, refuse anything unknown at the send boundary. About sixty lines of code and one Postgres table.

When we rebuilt the email agent for this Den Bosch broker, the wrinkle worth naming was that the Pinecone metadata-update endpoint returns 200 for a no-op, which means the wrong script can lie convincingly for years. We solved it by treating every retrieval-grounded outbound as suspect-by-default and proving freshness at the relay, not at the retriever.

Key takeaway

If your RAG pipeline only checks output against its retrieved context, you are checking that the agent is internally consistent, not that it is right.

FAQ

Why did the smoke query pass if the alias was broken?

The smoke query was run against the new namespace by name, not through the alias. It proved the upsert worked. It did not prove that production traffic would actually reach the new namespace.

Is this a Pinecone-specific failure?

No. Any aliased vector store has the same failure mode. The dual-index swap-gate works against pgvector, Weaviate, Qdrant or anything else with a routing pointer that can silently lie about where queries land.

How long do you keep the dual-index overlap running?

72 hours after a reindex, sometimes longer for products with high voorwaarden churn. The extra read units cost a handful of euros per quarter. Cheap relative to one bad Tuesday.

Could you have just versioned the index name with the date?

Yes, and we now do as a secondary safety net. The alias model is fine, but only if every swap is verified by a query through the alias for a canary chunk that exists only in the new namespace.

ai agentsemail automationragcase studyknowledge baseautomation

Building something?

Start a project