AI agents
AWS Bedrock data terms: anatomy of a 38-hour outage
At 19:42 on a Tuesday, a 24-person legal-tech team in Nijmegen watched their contract classifier go silent mid-deployment. Here is what broke and the 38-hour walk back.

It is 19:42 on a Tuesday in March. The on-call engineer at a 24-person legal-tech SaaS in Nijmegen pushes a routine deploy: two new clause templates, one parser fix. CI is green. The canary is green. Two minutes in, every contract uploaded to staging starts coming back with the same shape: an empty array where there should be ten or twelve classified clauses.
The product is a contract-review tool for mid-sized Dutch law firms. The classifier takes a PDF, extracts the text, and asks a hosted model to label each clause: confidentiality, indemnity, governing-law, termination-for-convenience, the usual landscape. The labels feed a reviewer UI. Without labels, the UI is a blank page.
By 20:10 the team has rolled back to the previous deploy. Same empty arrays. By 20:35 they have rolled back two more deploys. Same empty arrays. The on-call pings the founder. The outage that follows lasts 38 hours, and the cause turns out to live in a service terms PDF nobody on the team had read.
What had changed inside Bedrock
What the team did not know yet: Amazon had published an update to the Bedrock service terms three days earlier. The change broadened the set of Anthropic models for which AWS could route customer prompts and completions back to Anthropic for evaluation and future model training, unless the account had set a specific opt-out flag on each model invocation. The change had surfaced as an HN front-page story that same week, but nobody on the team had connected the headline to their own stack.
Their AWS account was not opted out. The Terraform module that provisioned the Bedrock provisioned-throughput unit had been written eight months earlier, before the opt-out mechanism existed. The terms change had taken effect on Monday. Their classifier, calling a Claude model through Bedrock with prompts that contained verbatim clauses from clients' contracts, started failing a server-side compliance check that returned an HTTP 200 with an empty content array.
200 OK. Empty content. No error. No log line worth grepping.
Managed inference providers can change terms and enforce them server-side without surfacing an error to your SDK. A 200 with an empty payload is the worst kind of failure: your code thinks the call succeeded, your monitoring thinks the call succeeded, and your users see a blank page.
Why the failure looked like our code
The classifier wrapper had been written defensively. If the model returned an empty array, the wrapper logged "no clauses detected" and returned an empty array to the caller. The wrapper had been tested against malformed PDFs and against contracts in languages the model handled badly. Empty array meant "we tried, nothing useful came back." The deploy that evening had touched the parser. The on-call engineer reasonably assumed the parser was the culprit. So did the next three engineers paged in.
They spent eleven hours bisecting the parser change before someone thought to send the same prompt directly to Bedrock with curl. Same empty content. They tried a hardcoded prompt from the unit-test fixtures, also pulled from a real contract. Same empty content. Then they tried the children's-book example from the AWS docs. Full response, normal completion. That was hour 14.
They tried the same fixture-contract prompt through a personal Anthropic key one of the engineers had on a side project, hitting the model directly. Full response. The clauses came back labelled correctly. Hour 15. At that point the team knew the problem was upstream of their code, not inside it.
Hour 18, getting unstuck
By hour 18 the working hypothesis was firm: the model was refusing prompts that contained client contract text, and the Bedrock wrapper was eating the refusal into an empty array rather than surfacing it as a structured error. They opened a support case with AWS. The first response came back six hours later and pointed them at the updated service terms and an invocation-level flag they had never set.
That was hour 24.
The team had two options. Set the flag on every model call, redeploy, and trust that nothing else in the terms had quietly shifted. Or cut over to a direct Anthropic API key with an EU data-residency addendum and stop depending on the Bedrock wrapper entirely.
They picked option two for three reasons. The contracts they processed were under Dutch law and frequently flagged for GDPR-sensitive clauses. Their general counsel was uneasy about implicit US training-data flows even with the flag set, because the audit trail for prior days was unrecoverable. And the team had been planning the move off Bedrock for six months and had a half-finished migration branch already pushed.
The cutover
The half-finished branch helped. The full cutover was still not a five-minute swap.
What had to change:
- The SDK call.
boto3.invoke_modelbecomes a direct Anthropic SDK call. - Authentication. An IAM role for Bedrock becomes an API key, stored in their existing secrets manager and rotated on the same schedule as their database credentials.
- Region pinning. Direct Anthropic processing region is pinned to the EU under their commercial agreement and a data-processing addendum, not configured in code.
- Retry and backoff. Bedrock has its own throttling story. Direct Anthropic uses different rate-limit headers and a different 429 cadence.
- Cost accounting. The team had a Bedrock dashboard wired into Grafana. Replaced with the Anthropic billing endpoint and the same Grafana panel.
The classifier wrapper had been thin, which helped. The before and after look almost the same:
import boto3, json
client = boto3.client("bedrock-runtime", region_name="eu-central-1")
resp = client.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": messages,
"max_tokens": 2048,
}),
)
body = json.loads(resp["body"].read())
clauses = parse_clauses(body["content"])
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Processing region is pinned to the EU under the data-residency
# addendum in the commercial agreement, not in this call.
resp = client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=messages,
max_tokens=2048,
)
clauses = parse_clauses(resp.content)
The cutover itself took four hours. They did it behind a feature flag, shifting traffic in 10% increments, and watched the per-clause label distributions for drift against a frozen golden-set of 200 contracts. If the label distribution at 10% looked different from the Bedrock distribution at 10%, they would have pulled back. It did not. At hour 38 from the start of the outage, they were running at 100% on the direct path.
What changed in the architecture afterwards
Three things became permanent after the incident.
The first is a synthetic health check that runs every five minutes against the production inference endpoint. The check sends a prompt that looks like real contract text but contains no client data. It asserts schema and rough cardinality (between 4 and 20 clauses) on the response. Any drift fires an alert before a real contract hits the endpoint. This single change would have collapsed the 38 hours into five minutes. It is also the cheapest thing on the list.
The second is that the wrapper now distinguishes empty array from null. Empty array means "we got a usable response from the provider that contained zero clauses, the contract was probably a cover sheet." Null means "the provider failed to return a usable response, look upstream right now." The reviewer UI handles them differently. Null shows an incident banner and pauses uploads. Empty array shows "no clauses found, please review manually." Two months in, the distinction has caught two more provider-side incidents that did not become outages.
The third is that the classifier no longer depends on a single inference provider. The wrapper has a primary path (direct Anthropic) and a fallback (a smaller open-weights classifier running on their own GPU). The fallback's labels are noisier and the reviewer UI surfaces that to the lawyer using it. The fallback was first exercised live during a planned drill three weeks after the cutover. It worked. The team has not had to fall back in anger yet.
What we would tell the team that was paged at midnight
If your inference is in a managed wrapper, you do not own the contract between you and the model. You own the contract between you and the wrapper. When the wrapper renegotiates upstream, you find out by reading release notes you did not subscribe to, or by your product going silent at 19:42 on a Tuesday.
This is not an argument against managed inference. Bedrock, Vertex, and Azure OpenAI all earn their keep on procurement, billing, IAM, and one invoice instead of three. It is an argument for treating the model behind the wrapper as an external dependency in the same way you treat a payment processor or an email provider: with a health check, a documented fallback, and a quarterly review of the terms that you are actually operating under.
When we built the email-agent for a Rotterdam logistics firm earlier this year, the failure that took us the longest to diagnose was identical in shape: an inference call that returned 200 OK with empty content, no provider error, no log line. We ended up landing on the same synthetic health-prompt pattern, and a hard rule that any empty-but-successful response trips an alert before it ever reaches the user. If you are shipping AI agents into production, the failure modes that hurt you most are the silent ones.
If you have one inference call in production this week, write a five-line health-check that hits the same model, the same region, and the same prompt shape every minute, and alerts on any response that does not match your schema. That alone would have caught this outage in under five minutes.
Key takeaway
When a managed inference provider changes its terms, the failure mode is silence, not an error. Health-check your model endpoints the way you health-check your database.
FAQ
Was this a Bedrock bug or a terms change?
A terms change with server-side enforcement. The compliance check returned an empty content array instead of a refusal or an error, so the SDK and the wrapper both treated the call as a normal success.
Could an opt-out flag have fixed the outage without cutting over?
Yes. Setting the documented flag on every model invocation would have restored service in under an hour. The team chose to leave Bedrock for unrelated GDPR and provider-independence reasons that had been queued for months.
Does Anthropic offer EU data residency directly?
Yes, under a commercial agreement and data-processing addendum. Region pinning is negotiated in the contract and enforced at the platform level, not configured in the SDK call itself.
What is the smallest thing to do this week?
Add a synthetic health prompt that runs against your production inference endpoint every minute and alerts on schema drift or empty-but-successful responses. Five lines of code, catches the silent failures.