Integrations

Graph token cache poisoning: a four-hour Teams agent outage

A tenant admin rotated one Entra app secret at 16:47 on a Friday, and a 28-person Ghent accountancy lost their Teams agent for four hours. Here's the walkthrough.

Jacob Molkenboer· Founder · A Brand New Company· 16 Dec 2024· 9 min

Brass relay with green tag, cream card with cracked red wax seal, frayed wire on ivory desk by window.

16:47, a Friday in late May. The IT lead at a 28-person accountancy in Ghent rotated a client secret on an Entra ID app registration as part of his quarterly hygiene pass. By 16:51 the firm's Teams agent, the one that triages client emails into Outlook folders and posts daily intake summaries into a Teams channel, stopped responding. By 17:30 the partner group chat had eight messages asking why "the bot" was broken. By 20:55 it was back. The four hours between were spent unpicking an MSAL token cache that was perfectly happy with a credential nobody else in the tenant could see, and the post we wanted to write down before we forgot it.

This was our build, deployed eight months earlier on a stack we know well. The failure was not in the code we wrote. It was in the assumption that rotating a single client secret is a low-risk operation.

The setup

The agent is a .NET 8 worker hosted on an Azure App Service Premium V3. It uses MSAL.NET with the confidential client flow against Microsoft Graph, scoped to Mail.ReadWrite, ChannelMessage.Send, and Chat.Read. Two instances run behind the App Service plan for redundancy. The MSAL distributed token cache is wired into an Azure Cache for Redis instance so that both workers see the same tokens and don't waste throttle budget acquiring duplicates.

The app registration in Entra ID had two active client secrets, staggered by six months. Standard pattern: one in use, one warming. The IT lead rotated the older one because his calendar reminder said so.

What rotating the secret actually did

In the Entra portal, removing a client secret is a single click followed by a confirmation. Microsoft documents the operation as immediate. There is no propagation delay, no soft-delete, no recovery window. The secret stops being valid the moment the API call returns.

Here is what we assumed would happen. The agent runs with the newer secret stored in Azure Key Vault. The older secret being deleted should be a no-op, because nothing references it.

Here is what actually happened. The agent had been redeployed three weeks earlier with the older secret as its active credential, after a separate rotation went sideways and someone (us) had left both secrets in App Configuration with the older one as "primary." The newer secret was sitting in Key Vault, marked active, but the worker process had never re-read it. So the worker was authenticating with a secret that the tenant admin just deleted, while the dashboard in Entra suggested the newer secret was in use.

Warning

An app registration's "active credentials" list in Entra tells you which secrets exist. It does not tell you which one your running process is actually sending in its token requests. Those are different facts.

The token cache failure mode

This is where it got interesting, and where the four hours came from.

MSAL.NET aggressively caches access tokens. The confidential client flow returns a token with a 60 to 90 minute lifetime. While that token is still inside its lifetime window, MSAL hands it out without going back to Entra. Microsoft's own MSAL guidance is explicit: do not call AcquireTokenForClient in a loop expecting fresh tokens, because you will get the cached one until it expires.

So at 16:47, when the secret was deleted, both worker instances had a valid cached access token sitting in Redis. The token was minted using the now-deleted secret, but the token itself is a signed JWT and Entra cannot retroactively invalidate it. Graph happily accepted the token. The agent kept working.

For about twelve minutes.

Then the token's expiry clock ran out. MSAL went to refresh. The refresh call sent the (deleted) client secret. Entra returned AADSTS7000215: Invalid client secret provided. MSAL caught the error and, per its policy, evicted the bad entry from the cache and bubbled the exception up to our code.

Our worker logged the error to Application Insights, threw a ServiceException, and the App Service health probe (which checks Graph reachability as part of its readiness logic) marked the instance unhealthy. App Service rolled the instance. The new instance came up, read the same configuration, attempted the same authentication, failed the same way, and was marked unhealthy. Same for the second instance. Within ten minutes the App Service plan had cycled both workers four times and was sitting in a flapping state.

The Redis cache made the flapping worse, not better. Each new instance pulled the stale cache state, retried the auth call with the same dead secret, and burned the same exception. Distributing the cache had bought us throttle headroom in normal operation. It had also turned a single-instance failure into a synchronized one.

The four-hour debug path

The on-call engineer (one of ours) was paged at 17:08. Here is the actual debug walk, because every step looked like the obvious cause and only the last one was right.

17:08 to 17:35. It's the deploy.

Application Insights showed the failure starting at 16:51. The first instinct was a bad deploy, because there had been a feature branch merged at 14:00 that day. We rolled back to the previous container image. Same failure. Eliminated.

17:35 to 18:20. It's Graph.

The error code AADSTS7000215 sounds like a Graph-side problem. The AADSTS error reference lists it as "invalid client secret provided." We assumed Entra was wrong, because the secret in Key Vault was the right one and Key Vault said so. We opened a Microsoft support ticket framed the wrong way: "Graph is rejecting a valid secret." Forty minutes lost to a hypothesis nobody at Microsoft was going to validate.

18:20 to 19:40. It's the cache.

We flushed the Redis MSAL cache, restarted both instances, and watched the same error come back. This was correct behavior, but it told us the cache was a symptom and not the cause. We then dumped the actual secret value the worker was sending, using a diagnostic build gated behind a feature flag we keep for exactly this, and compared it against the two secrets shown in Entra. The worker was sending the secret that Entra showed as "deleted two hours ago."

19:40 to 20:55. It's the configuration.

We traced the secret back to App Configuration, which loaded from a Key Vault reference. The reference was correct. The fetched value was the deleted secret. The Key Vault secret had two versions, and the worker was pinned to an older version by URI. We had set that URI six months ago during the prior rotation, and forgotten to remove the version pin when we marked the new secret active.

The fix was five lines. Replace this:

// pinned to a specific version, never drifts
appSettings: {
  GraphClientSecret: '@Microsoft.KeyVault(SecretUri=https://kv-ghent.vault.azure.net/secrets/graph-client-secret/9c4f...e2a1)'
}

With this:

// no version, always returns the current secret
appSettings: {
  GraphClientSecret: '@Microsoft.KeyVault(SecretUri=https://kv-ghent.vault.azure.net/secrets/graph-client-secret/)'
}

Redeploy. Tokens issued cleanly. Teams agent back online at 20:55.

What we changed in our deployment pattern

Three structural changes came out of this. They are boring. That is the point.

One: Key Vault references never pin a version. The version-pinned URI was the root cause. Pinning is appropriate for an immutable artifact, like a certificate you have decided not to renew yet. It is wrong for a credential you intend to rotate. The unpinned URI always returns the current secret. If you want safe rollouts, do them at the secret level, not the URI level.

Two: the worker logs which secret thumbprint it just used, on every token acquisition. Not the secret itself. The first six characters of a SHA-256 hash of the secret. That single log line would have cut three hours off this incident, because we would have seen at a glance that the worker was using a thumbprint that no longer existed in Entra.

var thumbprint = Convert.ToHexString(
    SHA256.HashData(Encoding.UTF8.GetBytes(secret))
)[..6];
_logger.LogInformation(
    "Acquired Graph token using secret {Thumbprint}",
    thumbprint
);

Three: the agent health probe distinguishes between "Graph is down" and "we can't authenticate." The old probe returned 503 on any Graph failure, which is what made the App Service plan flap. The new probe returns 200 on auth failure but emits a critical alert and pauses the worker's task loop. The reasoning: an auth failure means our config is wrong. Cycling the container does not fix our config. Stop cycling. Page the human.

The credential hygiene angle nobody likes

One more thing. The week we were debugging this, the Hacker News front page carried a story about Microsoft's open source tools being co-opted to steal credentials from AI developers. The root cause there is different from ours, but the lesson rhymes. Any credential that touches an AI agent has a larger blast radius than a human-only system, because the agent runs continuously, holds tokens longer, and tends to be deployed with less observability than a customer-facing API. A secret rotation that would be a one-minute distraction for a web app becomes a four-hour outage for an agent, because nobody is sitting in front of the agent watching it fail.

If you operate a Teams agent, an Outlook agent, or anything else that talks to Microsoft Graph on behalf of a tenant, do the five-minute audit today. Open your Key Vault secret references. Check whether any of them pin a version. If they do, rewrite them to drift to the current version. Then check whether your auth failure path actually pages a human, instead of cycling a container until the tenant admin notices the bot is silent.

When we built the Teams agent for this Ghent firm, the assumption that broke us was that Entra and our deployment stack would agree on which secret was the active one. They did not. We rebuilt our AI agent deployment pattern around that lesson, and the version-pinning rule is now in the project template. If you are running into the same flapping behavior on a Graph-backed agent, the fix is almost always in the Key Vault URI, not in your code.

Five minutes today: open one Key Vault reference URI on your agent and check whether it ends in a version hash. If it does, remove the hash and redeploy.

Key takeaway

Pinning a Key Vault secret URI to a version turns a routine credential rotation into a multi-hour outage. Reference the latest version, not a specific one.

FAQ

What is token cache poisoning in this context?

It's when an MSAL token cache keeps serving (or trying to refresh) tokens minted with a credential that no longer exists. The cache is fine; the upstream secret has been deleted, and every refresh attempt fails the same way.

Why does rotating one client secret break the agent if the other secret is still valid?

Because the running process doesn't necessarily use the secret Entra shows as active. If your Key Vault reference is pinned to a specific version, your worker keeps sending the old secret regardless of what the Entra portal displays.

How do I check whether my Graph integration is exposed to this?

Open every Key Vault reference your worker uses for client secrets or certificates. If the SecretUri ends in a version hash, you're exposed. Rewrite the URI without a version so it always returns the current secret.

Should the App Service health probe call Graph at all?

Only if you can distinguish auth failures from service failures. Cycling a container fixes a transient Graph outage. It cannot fix a credential mismatch, and trying will turn one stale secret into a synchronized flap across every instance.

integrationsai agentssecurityoperationscase studyautomation

Building something?

Start a project