Incident-walkthrough

Stale MCP cache: how an agent rewrote 1,920 SKUs silently

At 09:14 on a Wednesday, a Zwolle e-commerce agency's content agent kicked off its weekly rewrite. By 16:08, every product page on a Shopify store sounded like a different company.

Jacob Molkenboer· Founder · A Brand New Company· 14 Jun 2026· 9 min

Open craft-paper ledger on ivory desk, torn page with product card rows, green sticky tag, broken red wax seal.

At 16:08 on a Wednesday in May, the marketing lead at a 23-person Zwolle e-commerce agency opened her store on her phone to send a screenshot to a vendor. The hero product, a stainless steel coffee scoop, now described itself in second person, in present tense, with three emoji per paragraph and a closing line urging "tag a friend who needs better mornings". The voice was friendly, casual, and completely wrong. Their brand voice was understated and Dutch-direct. The scoop description had been correct that morning.

She refreshed the next product. Wrong. The next. Wrong. She opened the Shopify admin and sorted by updated at: 1,920 product descriptions had been rewritten between 09:14 and 16:08 that day. The agency had no campaign scheduled. No human had touched the store.

The content agent had been running. So had everything else they had built on top of it.

The setup

This agency runs Shopify storefronts for fourteen DTC brands. Two years ago they built a small fleet of internal agents to handle the work that used to live in spreadsheets: weekly SKU description refreshes, alt-text backfills, blog repurposing, sale-banner sweeps. The orchestrator is a Node service running on a small Fly machine. Domain tools are exposed through the Model Context Protocol (MCP), each tool a separate process the orchestrator spawns at boot.

One of those tools, style-guide, returns the active brand voice for a given client_id. The team built it eighteen months ago. Style guides change rarely, so the tool caches aggressively: in-memory LRU, two-hour TTL, keyed by client_id:locale. A second layer caches to disk so the tool can warm-start after a restart.

You can see the problem already. The team could not, until we ran the post-mortem with them.

What the tool returned

In March they offboarded a client. Call them Client A. Client A had been on a casual-friendly voice, big on emoji and second-person address. The agency removed Client A's storefront from the orchestrator's job list, archived the Slack channel, sent the goodbye email.

What they did not do, because nobody had thought to: invalidate the on-disk cache for Client A. There was no admin task for it because the tool had never needed one. Client A's voice sat happily in /var/cache/style-guide/client_a.json for ten weeks.

On the Wednesday morning of the incident, the orchestrator dispatched the weekly description-refresh job for Client B (the coffee scoop client). It built a prompt that began roughly like this:

const tools = [
  await mcp.call("style-guide", { client_id: "client_b", locale: "nl-NL" }),
  await mcp.call("product-catalog", { client_id: "client_b" }),
  // ...
]

The style-guide tool process had been restarted overnight (Fly recycles machines every 24h). On boot it read its on-disk cache index. The index file had been written by a developer doing a debug session a year ago, and it had a default entry: when client_id was unknown, fall back to the most recently used style guide. That fallback was Client A's voice. Code path nobody had touched since.

Why was client_b "unknown"? Because the in-memory LRU had not warmed yet, the disk index keyed style guides by an internal UUID that had rotated when the team moved environments six weeks earlier, and the lookup quietly missed. The tool returned a 200 with Client A's style guide. No error. No warning log. The cache contract said: always return a style guide.

The orchestrator did exactly what it was supposed to: take the tool result at face value and write 1,920 product descriptions in the voice it was handed.

Seven hours of silence

Why seven hours? Three reasons, none of them technical.

The agency had a Slack channel called #agent-alerts that the orchestrator posted to. It posted job-start, job-end, and any exception. That morning it posted "weekly-refresh started, 1,920 SKUs queued" and nobody read it, because the channel had been muted across the team after a noisy week in February. The agency also had a dashboard at agents.internal/agency/runs that showed throughput. It was green. Throughput was healthy. Nobody looked at the output of the runs, only their pulse.

And critically, the agency had a sample-review step in the original spec for this job. The first product description gets posted to Slack for a human thumbs-up before the rest run. That step had been bypassed eight months earlier when the team was deep in a sprint, because the human at the other end was on holiday and the queue had backed up. The bypass was a one-line config change. The config had been re-deployed since then. The bypass survived.

The first person to notice was the marketing lead, on her phone, because the product page on the website looked wrong. By that point the orchestrator had finished its job and shut down cleanly. The job ran, the job ended, the job logged success.

Warning

An agent that can write at scale needs a gate that fires on the first artifact, not the last. By the time a post-hoc check would have caught Client A's voice on the coffee scoop, the run was over.

The recovery

Shopify keeps the previous body_html on every product revision through the API, but the agency was not relying on that. Eighteen months ago, when they first wired the agent into Shopify, the team had taken thirty minutes to add a "stash previous value into a metafield before mutating" step. It felt like overkill at the time. It was the only reason the rollback took four hours instead of four days. They wrote a rollback script that night against the Admin GraphQL API:

import { GraphQLClient } from "graphql-request"

const shopify = new GraphQLClient(process.env.SHOPIFY_GRAPHQL_URL, {
  headers: { "X-Shopify-Access-Token": process.env.SHOPIFY_TOKEN },
})

const QUERY = `
  query($id: ID!) {
    product(id: $id) {
      id
      handle
      metafields(first: 1, namespace: "abn", key: "prev_description") {
        edges { node { value } }
      }
    }
  }
`

// For each affected product: read the stashed metafield, write it back
// as descriptionHtml, then delete the metafield once the restore confirms.

By 22:30 that night, 1,920 products were restored. They posted a short note to the affected client's Slack at 07:30 the next morning, before the client noticed.

What changed in the agent

The post-mortem produced a list. None of these are clever. All of them are obvious in hindsight. Every one of them existed somewhere in the agency's backlog already.

Tool responses get a tenant signature

Every MCP tool now signs its response with the client_id it believes it answered for, and the orchestrator asserts that the answered id matches the one it asked about. If it does not match, the call fails loudly.

// In every MCP tool's response
return {
  requested: { client_id, locale },
  resolved:  { client_id: actualClientLookedUp, locale: actualLocale },
  payload: styleGuide,
}

// In the orchestrator, after every tool call
if (response.requested.client_id !== response.resolved.client_id) {
  throw new ToolContractViolation(
    `style-guide returned data for ${response.resolved.client_id} ` +
    `when ${response.requested.client_id} was requested`
  )
}

No silent fallbacks, ever

The "if unknown, return most-recent" branch was deleted. Unknown client_id is now an error. The orchestrator decides what to do with an unknown client. The tool does not get to guess.

The sample-review step is no longer bypassable by config

It is a hard wait on a Slack reaction. If the channel is muted or the reviewer is asleep, the job stalls and pages the on-call engineer after ten minutes. Throughput dropped slightly. Trust improved a lot.

Cache invalidation is part of offboarding

Their offboarding checklist now ends with a script that purges every tool cache for the departing client_id. The script also asserts that the id is no longer referenced in any active job config. If it is, the offboarding does not complete.

A brand-voice classifier runs as a gate on every batch

This is the change we recommended hardest. Before the orchestrator commits a batch to Shopify, a separate small model reads the generated description and the canonical style guide and answers a single question: does this match. If the answer is no on more than one sample in twenty, the batch halts and a human is paged. Two hundred milliseconds per product. Worth every one.

The cache TTL was not the real problem

It is tempting to read this and conclude that the cache TTL was too long. It was not. A two-hour TTL on a style guide that changes monthly is fine. The actual problem was that the tool had no concept of "no data" as a valid response. Every cache, every tool, every integration eventually faces a moment where the right answer is "I do not know." Tools that treat that moment as an error to be papered over are how silent incidents start.

The MCP specification does not enforce any particular tool contract. That is by design: the protocol is transport, not policy. The policy is yours to write. Write it.

The smallest change you can make today

Pull up one of your agent's tools and read its happy path. Now read its error path. If the error path includes any phrase like "fall back to default" or "use last known good", you have the same bug. Open an issue. Replace the fallback with an explicit error, and let the caller decide what "no data" means in context. That is twenty minutes of work, and it is the difference between a 7-hour incident and a tool call that fails in the first second of the run.

When we built the agent fleet for a Dutch fashion retailer last spring, the thing we ran into was a near-identical version of this story: an MCP tool with a sensible-looking default that turned out to be the wrong default for one of their seven brands. We solved it with the tenant-signature pattern above and a brand-voice classifier as a hard gate before any write. If you are wiring AI agents into a CMS or a PIM, that gate is the part you build first. The actual work, the agents that write copy or chase invoices or triage tickets, is the easy half.

Key takeaway

An agent that can write at scale needs a gate that fires on the first artifact, not the last.

FAQ

What is MCP?

Model Context Protocol, an open spec for connecting tools to an LLM-driven orchestrator. Each tool runs as its own process and exposes a typed interface the model can call.

Could this happen with a different agent framework?

Yes. The bug is not MCP-specific. Any tool layer that returns a default value when context is missing has the same failure mode.

How do you safely roll back thousands of product descriptions?

Stash the previous body_html into a Shopify metafield before any agent write. That single habit turns a rollback into a script instead of an emergency.

What is a brand-voice classifier?

A small model that reads a generated description and the canonical style guide and answers whether they match. Cheap, fast, and the right place to put the gate before a batch commits.

Why did the muted Slack channel matter?

Alerts go to channels people read. If the agent's only alarm lives in a muted channel, the alarm does not exist. Route critical alerts to a channel the on-call cannot mute.

ai agentsautomationintegrationstoolingarchitecturee-commerce

Building something?

Start a project