← Blog

AI agents

AI agent incident: when a policy change ate the output

A 38-person Antwerp legaltech watched its contract-review agent return two-sentence summaries instead of twelve. The cause sat in a workspace flag none of them owned.

Jacob Molkenboer· Founder · A Brand New Company· 12 Jun 2026· 9 min
Brass relay switch mid-flip on ivory linen beside a folded carbon memo with chartreuse sticky tab, pencil stub, paperclip.

At 14:08 Brussels time on a Wednesday in late May, a paralegal at Clausemark, a 38-person legaltech in Antwerp, dropped a 14-page distribution agreement into the contract-review queue. The agent returned a summary. Two sentences. It should have been twelve.

Her first reaction was the right one: she re-ran the contract. Same result. She tried a shorter NDA. Two sentences. She walked to the engineering room.

What follows is the post-mortem of the next six hours, written with permission and a few names changed. The trigger was a vendor retention policy change that landed on production workspaces that morning. The damage was not data loss. The damage was silent quality regression on every contract the agent had touched since 06:00.

How the failure surfaced

Clausemark runs a contract-review agent on a frontier model, routed through a thin internal wrapper they call cm-review. Lawyers paste clauses, the agent extracts obligations, flags non-standard language, and produces a clause-by-clause summary. Average output: about 1,800 tokens.

Between 06:14 and 13:52, 117 contracts went through the pipeline. None failed. Every single one returned a summary that stopped mid-clause. The QA loop on the dashboard, which checks for token count, schema validity, and refusal language, marked all 117 as green. The output was valid JSON. The summaries were just short.

This is the canonical AI agent bug: nothing crashed, no exception bubbled, no alert fired. A senior associate happened to read one of the summaries before sending it to a client, and noticed it ended at clause 4 of 11. He flagged it on Slack. The paralegal at the start of the queue confirmed she had been seeing odd-feeling outputs all morning but assumed the contracts were simpler than usual.

Warning

If your agent's only health check is “did it return valid JSON,” you are not monitoring the agent. You are monitoring the serializer.

The first 90 minutes

The engineering team had three theories within ten minutes:

  • The vendor pushed a new model snapshot overnight and the system prompt no longer reliably elicits the full schema.
  • Their orchestrator (a homegrown layer on top of LangGraph) was hitting a recursion limit and bailing early.
  • Someone changed the prompt template and forgot to push the migration.

All three were wrong, but all three were checked first because all three were familiar. The actual cause sat in a place none of them had looked at since the system was first wired up: the workspace settings page on the model vendor side.

The shape of the regression

By 15:30, the lead engineer (Joren) had a clean reproduction. He pulled the raw API request for a known-good contract from the prior day's logs and re-ran it, byte-for-byte, against production. The response came back at 312 output tokens. Same request, same model, same prompt, eleven hours apart. Truncated.

Then he ran the same request against his personal console with a different API key. Full output. 1,847 tokens.

The variable was the workspace, not the request. That single observation cut the search space in half.

Root cause: a flag we did not own

At some point during the morning, the workspace's zero_data_retention flag had silently flipped from true to false. The reason was structural. The model vendor had published a policy change requiring a 30-day retention floor for the model family Clausemark was running. Workspaces that had previously been opted into zero retention had their flag flipped off when the policy took effect, because zero retention was no longer a permissible state for those models. The change was announced. The change was not loud.

That alone is not a bug. The bug was in Clausemark's wrapper code.

def build_request(contract: str, ws: Workspace) -> dict:
    if ws.privacy_mode:
        return {
            "model": "frontier-3.7",
            "max_tokens": 4096,
            "system": HIGH_PRIVACY_PROMPT,
            "messages": [{"role": "user", "content": contract}],
        }
    # Fallback for non-privacy workspaces. Predates the current model.
    return {
        "model": "frontier-3.7",
        "max_tokens": 512,
        "system": STANDARD_PROMPT,
        "messages": [{"role": "user", "content": contract}],
    }

Two years earlier, when Clausemark was first set up, the engineer who wrote build_request assumed privacy mode would always be on. It was a hard requirement of one of their three flagship clients. The non-privacy branch was a “just in case” fallback with a 512-token cap, copied from an older proof-of-concept. It had never been hit in production. Until 06:14 that morning.

When the workspace flag flipped, every request started taking the fallback branch. 512 output tokens is enough for a valid JSON envelope and the first few clauses of a summary. So the schema passed. The content was a quarter of what the lawyers expected.

The fix, in two parts

The immediate fix took eleven minutes:

def build_request(contract: str, ws: Workspace) -> dict:
    return {
        "model": "frontier-3.7",
        "max_tokens": 4096,
        "system": HIGH_PRIVACY_PROMPT if ws.privacy_mode else STANDARD_PROMPT,
        "messages": [{"role": "user", "content": contract}],
    }

The structural fix took the next two weeks and is the reason this post exists. Eleven minutes restored output quality. The two weeks made sure the next vendor surprise would be caught before lawyers were the canary.

What we changed about how the agent runs

We worked through this with the Clausemark team and rebuilt their reliability layer. Four changes mattered.

Output-shape checks, not output-validity checks

“The JSON parsed” is not a quality signal. The new health check counts clauses found in the input contract and compares against clauses summarised in the output. A summary that covers 4 of 11 clauses is now a P1 page, not a green check. We also sample 2% of outputs into a second model that grades completeness on a 1 to 5 scale. Any sustained drop in the rolling 30-minute average triggers an alert.

Make vendor-side state observable

The workspace settings were a black box. We added a daily job that hits the vendor's workspace admin API, pulls the workspace configuration (model access, retention setting, rate limits, default headers), diffs it against a checked-in YAML file, and opens a pull request if anything has drifted. The diff that would have caught this incident on day one looks like this:

 workspace:
   name: clausemark-prod
   models:
     - frontier-3.7
-  zero_data_retention: true
+  zero_data_retention: false

The same pattern works for any vendor whose admin surface has a read API. For vendors that do not, screenshot the settings page weekly into a versioned folder. Imperfect, but better than nothing.

Treat vendor policy changes as deploys

The retention change had been on the vendor changelog for weeks before it landed. Nobody at Clausemark read it, because nobody owned reading it. There is now a Friday hour on the calendar of one engineer whose only job in that hour is reading vendor changelogs (model provider, auth provider, database vendor) and filing tickets for anything that touches behaviour, pricing, or data residency. A policy change you can read four weeks in advance is a policy change you can deploy against, instead of a policy change that deploys you.

Kill dead branches

The 512-token fallback was a branch that had not run in production in two years. It still passed code review every time the file was touched, because nobody questioned a guarded “just in case” path. We now have a quarterly review of every conditional branch in the agent pipeline that has zero hits in the prior 90 days of logs. Most get deleted. The ones that survive get a comment explaining why they exist and what would have to change for them to be deleted.

Takeaway

A dead branch in an AI agent pipeline is a loaded gun on the wall. Sooner or later, a vendor flag will pull the trigger.

The wider context

This is not an isolated kind of failure. The pattern repeats across the agent incidents we have read or worked through. “The agent did something weird” is almost never the right framing. The agent did exactly what the surrounding code, prompts, tool definitions, and workspace settings told it to do. The surrounding code was wrong. The surrounding code had a path nobody had thought about in two years. The surrounding code did not know its own vendor had moved the floor.

Frameworks like Burr exist partly because of this. Their pitch (build agents as inspectable state machines, with every transition logged and replayable) is not about modelling better. It is about being able to answer the question “what did the agent see, and what did it decide” twelve hours after the fact, when a paralegal walks in with a two-line summary and asks why.

What this incident actually cost

Nothing went to a client. The associate who caught the truncation went back through the 117 contracts, re-ran each one through the fixed pipeline, and compared. Eleven were already in outbound email drafts; none had been sent. The total user-visible cost was zero. The internal cost was about 40 engineer-hours and a difficult conversation about why a vendor policy flip could quietly degrade output quality for almost eight hours before anyone noticed.

That conversation is the actual artefact. The eleven-minute fix did not change anything important. The conversation changed how Clausemark thinks about the boundary between code they own and state a vendor owns.

What to do this afternoon

When we built the contract-review pipeline for Clausemark, the thing we ran into was that every AI agent eventually depends on a piece of state we do not control: a vendor flag, a model version, a workspace policy. We ended up solving it by making that state a first-class deploy artifact, diffed and reviewed like code. If you run AI agents in production, the smallest thing you can do today is open your vendor admin console, screenshot the workspace settings, drop the image in a folder, and put a calendar reminder to do it again in a month. If anything has changed without a corresponding ticket in your tracker, you have just found your next P1.

Key takeaway

A dead branch in your agent code is a loaded gun on the wall. A vendor flag flip will eventually pull the trigger.

FAQ

What actually caused the truncated summaries?

A workspace privacy flag flipped off silently when the model vendor's new 30-day retention policy landed. The fallback code path used a 512-token output cap that had been dormant for two years.

Why didn't the health check catch it?

The check validated JSON schema, token count, and refusal language. A valid JSON envelope with one clause covered passed every test. Output completeness was never measured.

How do you monitor for this kind of silent regression?

Diff vendor workspace settings against a checked-in YAML on a daily cron, and add an output-completeness check that compares input structure to output coverage, not just schema validity.

Could this happen with other LLM vendors?

Yes. Any vendor flag that changes model behaviour, output limits, or routing outside your deploy pipeline can produce a silent regression. The class of bug is vendor-agnostic.

Was any client data exposed?

No. The flag flip changed retention, not access. The cost was quality degradation on internal outputs that were caught before any contract summary went to a client.

ai agentscase studyoperationsarchitectureworkflowbusiness

Building something?

Start a project