RAG

RAG tenant leakage: anatomy of a seven-hour incident

At 10:40 on a Tuesday, a Dutch freight forwarder asked why our agent was quoting a competitor's tariff sheet. The answer was a bucket rename that nobody flagged.

Jacob Molkenboer· Founder · A Brand New Company· 6 Jun 2026· 8 min

Open oak index-card drawer on pale desk, brass divider askew, one green card among cream cards, soft side light.

At 10:40 on a Tuesday, an account manager at a Dutch freight forwarder forwarded us a Slack screenshot. Their internal RAG agent, the one we had shipped six weeks earlier to answer "what's our rate from Tilburg to Lyon for a 13.6m trailer", had just quoted them tariffs from a competitor. Not similar tariffs. Lifted verbatim from a competitor's PDF, citation block and all.

The customer was polite. They wanted to know if it was a one-off, or if their own pricing was being leaked the other way. We told them we would know within the hour. It took seven.

What follows is what happened, in the order we understood it, with the code we changed afterwards. The bug was not exotic. The lessons are not new. They are, however, the kind that get written into design docs by every team that has lived through them, and skipped by every team that has not.

The architecture, before anything broke

The system was a fairly typical retrieval-augmented setup, the sort of thing you will find in any mid-sized B2B knowledge agent.

One S3 bucket per tenant, holding the customer's source PDFs: tariff sheets, SLA addenda, route handbooks, customs documentation. A sync worker watching S3 events via SQS, chunking and embedding each new or updated object, then upserting into a shared Pinecone index with the tenant ID stamped onto every chunk's metadata. A retrieval service that read the tenant from the JWT and filtered the index query by metadata.tenant before passing the top-k chunks to the LLM. Bog standard.

The bucket-to-tenant mapping lived in a YAML file loaded at worker startup:

BUCKET_TO_TENANT = {
    "vondel-logistiek-docs": "vondel",
    "rijnpoort-cargo-docs": "rijnpoort",
    "kanaal-zuid-shipping-docs": "kanaalzuid",
    # ...
}

def resolve_tenant(bucket_name: str) -> str:
    return BUCKET_TO_TENANT.get(bucket_name, "general-knowledge")

That fallback is the first half of the story. We will get to the other half.

There was also a namespace in the index called general-knowledge. The intent had been reasonable: a shared layer where we indexed truly public documents like Incoterms 2020 references, CMR consignment-note explainers, and the FENEX general conditions. The retrieval service always included it in the filter so the agent could answer generic logistics questions without each tenant having to re-upload the same public references:

results = index.query(
    vector=question_embedding,
    top_k=8,
    filter={"tenant": {"$in": [tenant_id, "general-knowledge"]}},
)

This pattern is common. Pinecone's own docs on metadata filtering show $in as the canonical way to union tenants with a shared layer. We had not invented anything weird. We had, however, built two faults in series, and on Tuesday morning they aligned.

The rename

The customer's IT lead emailed our on-call channel at 09:14: "Renaming our docs bucket from vondel-logistiek-docs to vondel-knowledge-2026, please update your end."

S3 has no rename. You create a new bucket and copy. The customer's IT lead did exactly that. Their internal sync tool started moving objects at 09:14 and finished around 09:22. From S3's perspective those were not renamed files. They were brand-new objects in a brand-new bucket, each one firing an ObjectCreated event.

Our SQS subscription was repointed to the new bucket within minutes. That was the part the IT lead asked about, and we handled it the way you would expect. What nobody updated was the BUCKET_TO_TENANT map in our worker config. The worker started receiving hundreds of ObjectCreated events for vondel-knowledge-2026, looked them up in a dict that still listed vondel-logistiek-docs, missed every time, and fell through to the default. Every tariff PDF the customer had ever uploaded was being re-embedded into general-knowledge.

Warning

If your tenant resolver has a default branch, it is one outage away from being your data-leakage vector. Fail closed. An unknown bucket is an exception, not a namespace.

How a competitor's PDF was already in the shared namespace

Re-indexing one customer's documents into general-knowledge would have been bad on its own. But the namespace was not empty in the way we believed.

Eight months earlier, during a prospect pitch with a different freight forwarder, one of our engineers had hand-uploaded that prospect's public tariff brochure into general-knowledge to benchmark ranking quality against the Incoterms baseline. It was a five-minute experiment. The PDF was meant to be removed after the demo. It was not. It had been sitting in the shared namespace ever since, never touched by any agent, because no production tenant had ever asked about that company's specific rates.

Until 10:38, when Vondel's account manager asked their agent: "what's our rate Tilburg to Lyon, 13.6m trailer, weekly". The query embedded close to three things at once:

Vondel's own 2026 tariff sheet, freshly re-embedded under general-knowledge by the broken worker.
An Incoterms 2020 explainer, the legitimate inhabitant of the shared layer.
The competitor's brochure from the eight-month-old demo.

The competitor's brochure had cleaner section headers, the kind of "Tarieven binnenland en intra-EU, per kilometer en per stop" structure that embeds into tighter clusters and ranks higher under cosine similarity than Vondel's own less-structured internal handbook. The agent picked it. The citation block at the bottom of the response said "Source: Tariffs 2024, [Competitor Name]". The account manager screenshotted it. The Slack message landed in our inbox four minutes after that.

The seven-hour timeline

We do not write postmortems with hour-by-hour drama, but it is worth showing the shape, because most of the seven hours were not spent fixing. They were spent finding.

10:40. Ticket arrives. We acknowledge in 90 seconds.
10:55. First hypothesis: embedding drift from a recent model swap. Wrong. We had not swapped models in a month.
11:30. We confirm the citation block is real. The source PDF exists in general-knowledge. Question: how did it get there?
12:15. Audit log shows the PDF was uploaded eight months ago by a now-former engineer. We mark it for quarantine but still do not understand why Vondel's query reached it.
13:10. Customer's IT lead mentions the bucket rename "earlier today" in a follow-up about something else. The penny drops.
13:45. We trace the resolver and find the fallback. We stop the worker.
14:30. Full audit of which Vondel chunks were written into general-knowledge: 1,847 vectors. We delete them by metadata filter.
15:10. Competitor brochure deleted with prejudice. Audit confirms no other tenant ever retrieved it.
16:14. Full re-sync from vondel-knowledge-2026 into the correct namespace. Smoke tests pass. Worker re-enabled.
17:38. Written postmortem delivered to the customer. They reply, "Thanks, that was fast", which we appreciated and did not deserve.

The thing worth noting in that timeline is the gap between 11:30 and 13:10. We spent an hour and forty minutes treating the symptom (a poisoned shared namespace) without understanding the trigger (the bucket rename). The customer mentioned the rename casually, almost as small talk. Had they not, we would have spent the rest of the day looking in the wrong place.

What we changed

Three changes shipped within 48 hours. None of them are clever. All of them are things we should have done at design time.

The resolver fails closed

No default namespace. An unknown bucket raises, the worker logs, the message returns to SQS, and on-call gets paged. The cost of a handful of false-positive pages over the next year is a fraction of the cost of one true-negative leak.

class UnknownBucketError(Exception):
    pass

def resolve_tenant(bucket_name: str) -> str:
    try:
        return BUCKET_TO_TENANT[bucket_name]
    except KeyError as e:
        raise UnknownBucketError(
            f"No tenant mapping for bucket {bucket_name!r}. "
            f"Add it to config or quarantine the message."
        ) from e

The retrieval filter is strict

We killed the general-knowledge namespace. Every tenant now owns a private copy of the shared corpus: the Incoterms references, the CMR explainers, the FENEX conditions. Storage cost went up by roughly forty euros per tenant per month. That is, transparently, the price of not running a cross-tenant union in a retrieval filter. We are happy to pay it.

results = index.query(
    vector=question_embedding,
    top_k=8,
    filter={"tenant": {"$eq": tenant_id}},   # exact, not $in
)

If you ever find yourself widening a tenant filter with $in or $or, treat it as a security review, not a feature. OWASP's LLM Top 10 calls this class of problem LLM06: Sensitive Information Disclosure. It is the easiest bug to write and the hardest one to spot in code review, because the code reads as correct English.

The bucket is its own map

We removed the YAML mapping entirely. Each bucket now carries its tenant ID as an S3 tag, and the worker reads the tag rather than a sidecar config file. Renaming a bucket without re-tagging it is now physically impossible to mis-route, because an untagged bucket fails closed at the resolver. The map and the bucket are the same artefact, which means they cannot drift.

Adversarial retrieval tests, finally

The bug that bit us would have been caught by a single test: ask a question that only matches the competitor's PDF, then assert that the retrieval returns zero results for tenant Vondel. We did not have that test. We had functional tests ("does the agent answer correctly?") but no negative tests ("does the agent refuse to retrieve from documents it should never see?").

This is the same gap that recent open-source work on AI-powered vulnerability discovery pokes at: running adversarial probes against a system to find the failure modes that humans miss in review. For RAG the equivalent is a small fuzz harness. Generate questions tied to each piece of out-of-tenant content. Query the system as every tenant. Assert that the wrong tenant never retrieves it. We now run one nightly across all tenants and the entire shared corpus. In the first week it surfaced two other latent issues, both small, both worth fixing before they grew.

Takeaway

Every RAG system that filters by tenant needs a test that proves the filter holds. Without one, you are trusting a metadata field you cannot see.

The smallest thing to do today

Open the file in your RAG codebase that translates an incoming request into a retrieval filter. Read it line by line. If there is any branch that widens the filter, an $in, an $or, a fallback to a "shared" or "default" or "public" namespace, write a single test today that asserts the widening cannot reach another tenant's data. The test does not have to be sophisticated. It has to exist.

When we built the RAG layer for this freight forwarder, the thing we ran into was an architecture that quietly assumed buckets were stable and shared namespaces were trustworthy. We ended up solving it by removing both assumptions and letting the worker fail loudly when the world surprises it.

Key takeaway

A bucket rename revealed two latent flaws: a tenant resolver with a default fallback, and a retrieval filter that unioned tenants with a shared namespace.

FAQ

How can a competitor's PDF end up in a customer's RAG answers?

Usually through a shared namespace that retrieval unions with the tenant filter, plus a piece of content that landed in that namespace by mistake. Both are fixable, and both should be tested for.

Does renaming an S3 bucket move the objects?

No. S3 has no rename. You create a new bucket and copy. Anything downstream that keys off the old bucket name will not see the new one unless you update it explicitly.

Is a shared "general knowledge" namespace in RAG always a bad idea?

Not always, but it is risky. If you keep one, gate writes behind review and run nightly tests that assert no tenant retrieves a document from it that they should not.

What did the seven hours actually cost?

One customer with a leaked competitor citation in a single agent response, no two-way pricing leak, and a week of trust to rebuild. The financial damage was small. The reputational risk was not.

ragarchitecturesecurityoperationscase studyknowledge base

Building something?

Start a project