AI agents

Claude Fable casing drift: how an HR agent broke 3,400 records

At 06:40 Amsterdam time the schema validator alerted on 3,400 candidate intake rows with null first names. The onboarding agent had been writing them all night.

Jacob Molkenboer· Founder · A Brand New Company· 24 Feb 2025· 10 min

Open brass pneumatic tube capsule beside a typed intake form with a green sticky note and broken red wax seal on ivory paper.

At 06:40 Amsterdam time on a Wednesday, the on-call pager for a 22-person HR-tech SaaS in Apeldoorn went off. A nightly schema validator had flagged 3,407 candidate intake rows from the last eight hours. The first_name column was null on every one of them. The agent had cheerfully written them, the database had cheerfully accepted them, and the validator had cheerfully waited until its 06:40 cron to notice.

The product was an onboarding agent. New hires upload an ID, the agent extracts fields, calls a tool to insert a candidate row, and routes them to the right manager. Standard stuff. The agent had been live for fourteen months. Nothing about its prompt, tool schemas, or downstream code had changed in the last six weeks.

What had changed: at 22:15 the night before, the team had flipped the model id from the previous Sonnet build to claude-fable-5. They did not roll back. They did not redeploy. They changed one string in one config file and shipped. The blast radius was a casing convention. Here is what happened, and what we now do differently when we run a model swap on a live agent.

The agent loop, before and after

The agent had one tool that mattered for this story: create_candidate. Its schema looked roughly like this.

{
  "name": "create_candidate",
  "description": "Insert a new candidate row from extracted ID fields.",
  "input_schema": {
    "type": "object",
    "properties": {
      "first_name":  { "type": "string" },
      "last_name":   { "type": "string" },
      "dob":         { "type": "string", "format": "date" },
      "nationality": { "type": "string" }
    },
    "required": ["first_name", "last_name", "dob"]
  }
}

On the old model, tool-call arguments came back in snake_case, exactly as declared. On the new model, for reasons we will get to, they came back in camelCase. The schema was respected as a contract on field names by neither version reliably, but the previous build had drifted toward snake_case in roughly 99.6% of calls. The new build drifted the other way.

The tool handler did this:

// services/candidates.ts
export async function createCandidate(args: Record<string, unknown>) {
  await db.insert("candidates", {
    first_name:  args.first_name,
    last_name:   args.last_name,
    dob:         args.dob,
    nationality: args.nationality,
  })
}

You can see the bug already. args.first_name is undefined when the model returns firstName. The insert succeeded because every column was nullable. The row went in. The candidate showed up in the dashboard with a blank name and no nationality. The manager-routing step keyed off the candidate id, not the name, so the candidate was still routed correctly. Nobody saw a broken UI. Nobody got an exception in Sentry. The agent loop did not retry, because from its point of view the tool returned 200 OK.

Why the keys changed

We do not know exactly why the new model preferred camelCase. The post-mortem from the team's side is that frontier models absorb naming conventions from their training corpus, and JavaScript and TypeScript code dominate that corpus. snake_case is a Python and SQL convention; camelCase is a JS convention. The schema declared snake_case, but JSON Schema does not pin field naming as an invariant in the way you might assume.

The model could read the schema as "produce JSON with these four keys, of these types, with these three required" and produce JSON that satisfies those constraints under any aliasing it found natural. That sounds like a model bug. It isn't. It is the schema being under-specified relative to what we needed. If a contract permits multiple correct readings, expect to see all of them eventually, and expect the distribution to shift the moment the writer on the other side changes.

Strict mode that wasn't

The team had set strict: true on the tool definition. They assumed this meant the model could not return arguments that didn't match the schema. What it actually means, per Anthropic's tool-use documentation, is that the model is constrained to produce JSON conformant with the declared types. Field naming is part of that contract on paper. In practice, when a model returns extra fields, the validator inside their SDK version treated them as additional properties and silently passed them through.

The reason is mechanical. The tool's input_schema was a normal JSON Schema object with properties declared and no additionalProperties: false. By default, JSON Schema permits additional properties. The schema generator the team used had never bothered to set the flag, because the previous model had not had a habit of emitting extras. When the new model started emitting first_name as firstName, the validator saw that as an unknown but allowed extra property, and the four declared properties were simply missing from the payload. Missing optional properties pass; the required ones were technically missing too, but the SDK's strict-mode enforcement at this layer ran on declared-type checks for present fields, not on a final required-key sweep against the payload it was about to hand back.

That isn't the model's fault. It is the gap between "the schema says snake_case" and "the validator rejects everything else." The chain was:

Model emits { "firstName": "Iris", "lastName": "de Vries", ... }.
SDK accepts the tool call as schema-valid JSON.
Handler reads args.first_name, gets undefined.
Insert runs with nulls. No error.

Strict mode in the SDK did not turn unexpected casing into a hard reject. The model was producing valid JSON. The keys it chose simply weren't the keys the handler asked for.

Warning

If your tool handler reads arguments by literal key name and your database columns are nullable, a model behavior shift can write garbage for hours before anything raises. Your error budget is whatever the gap is between "data lands" and "data gets read."

The 06:40 alert and the next 47 minutes

The validator we run is a nightly Pydantic check against a strict-mode model that mirrors the database schema. It runs at 06:40 because that is when the Apeldoorn ops lead wants to see results before standup. It flagged 3,407 rows with null first_name and null nationality. The fingerprint was identical across every row: the same two fields blank, every other field present.

That fingerprint was the lucky break. We knew within ten minutes that it wasn't a database bug, a network blip, or a corrupt upload. Two specific columns were going null in lockstep, on every row, for eight hours straight. That is a code path, not an outage.

The business impact was already visible by the time we caught it. The Apeldoorn HR-tech client had fourteen corporate accounts using the platform; each had received a slice of overnight applicants. Morning-shift managers were already opening cases where they could not find a candidate by name in the search box, because the name was blank. Six of them had started phoning candidates to re-confirm details, which is exactly the work the agent existed to remove. A 47-minute fix window matters when the alternative is your customer's customer getting called twice before 09:00.

The on-call engineer pulled three things in parallel:

The model id in production config: claude-fable-5.
The last twenty raw tool-call payloads from the request log.
The diff of the agent prompt and tool schemas over the last 30 days.

The raw payloads were the smoking gun. Every single one returned camelCase keys. Payloads from 22:00 the previous night, before the model flip, were snake_case. Payloads from 22:30 onward were camelCase. From the model's point of view, both are valid JSON for a schema that did not pin casing as an invariant.

The repair

By 07:27 the fix was in production. Three changes:

// services/candidates.ts
import { z } from "zod"

const KEY_MAP: Record<string, string> = {
  firstName: "first_name",
  lastName: "last_name",
  dateOfBirth: "dob",
}

function normalize(args: Record<string, unknown>) {
  const out: Record<string, unknown> = {}
  for (const [k, v] of Object.entries(args)) {
    out[KEY_MAP[k] ?? k] = v
  }
  return out
}

const CandidateSchema = z.object({
  first_name:  z.string().min(1),
  last_name:   z.string().min(1),
  dob:         z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
  nationality: z.string().min(2),
}).strict()

export async function createCandidate(rawArgs: Record<string, unknown>) {
  const args = CandidateSchema.parse(normalize(rawArgs))
  await db.insert("candidates", args)
}

The Zod schema is set to .strict(), which rejects unknown keys instead of silently dropping them. The handler normalizes the camelCase aliases we know about before parsing. If a future model emits something we haven't mapped, the parse throws, the tool call returns an error to the agent, and the agent retries. Loud is better than silent. Pydantic's strict mode has the same shape on the Python services we run.

Backfill was straightforward because the team had the raw model output stored. We re-extracted the missing fields from the original tool-call payloads and patched the rows. By 09:15 the data was clean and the manager workload was caught up. We logged the incident, kept the raw payloads as a regression fixture, and added a unit test that asserts the handler tolerates both casings.

Three shifts in how we ship models

The incident wasn't really about casing. It was about three assumptions that turned out to be soft.

1. A model id change is a deployment, not a config toggle

Flipping a model id should go through the same review and canary that a code change does. The team had treated it as a config toggle. Now the model id is pinned per environment, changes go through a PR, and a 1% canary runs for 24 hours against a shadow handler that diff-checks tool-call payloads against the last known good distribution.

The shadow handler runs in parallel with the production handler for the canary's traffic share. Every payload is normalised, hashed by its key set, and compared to the distribution from the previous seven days. If more than 2% of payloads in a 30-minute window present a key set we have never seen, the canary fails and the rollback is automatic. We get the diff in Slack with the specific new keys, the model id, and a sample payload. Two operations engineers can read that message in a minute and know whether to ship or revert.

2. Nullable columns are a vulnerability when an agent is the writer

Humans typing into a form get a required-field error. Agents writing through an SDK get an optimistic insert. We made every column on agent-written tables NOT NULL or wrapped them in a domain-level guard that throws before the insert. The database is the last honest reviewer; let it do its job. Where a column has to stay nullable for a legitimate reason, the table now carries a CHECK constraint that asserts the row makes sense as a whole, rather than letting each column wave its own permission through.

3. The contract has to be enforced on your side too

Strict mode in a tool SDK is a property of the parser, not of the contract. Every tool handler in production now passes arguments through a Zod or Pydantic strict parser before it touches the database. Unknown keys are a hard reject. The agent gets a structured error back and tries again. We log every parse failure with the raw payload, the model id, and the agent prompt version, so when a model behavior shifts we see it inside an hour, not eight.

There is a broader point here that has been showing up across the industry as new model versions ship more frequently. Behavior shifts in frontier models don't fire alarms; they degrade outputs in ways your existing tests don't cover. Confidence in a model output is not the same as a guarantee of one, and the only people who find out before their customers do are the ones who built the validators themselves. The teams getting this right write fewer prompts and more parsers.

The five-minute audit you can run today

If you have an agent in production today and you read this far, here is a five-minute check. Open one tool handler. Find the line that reads an argument by literal key. Ask yourself two questions.

First: if the model returned the same data under a slightly different key name, would your handler notice? Second: if it didn't notice, what is the worst row it could write to the database? If the answer to the second question is "a row that no validator will catch for hours," fix that one handler before you do anything else. The rest can wait.

When we built the onboarding agent for the Apeldoorn HR-tech client, the thing we ran into was exactly this gap between "JSON parsed fine" and "the row means what it should." We ended up wrapping every tool handler in a strict schema parser and adding the shadow canary to every model swap, which is the kind of AI agents review we now run by default on any production deployment.

Today's smallest action: grep your tool handlers for direct key reads. Count them. That number is your blast radius the next time a model behavior shifts under you.

Key takeaway

A model rollout is a deployment, not a config change. Pin the version, canary the swap, and reject unknown keys at the tool handler.

FAQ

What is schema drift in an AI agent context?

When the shape of model outputs (field names, casing, optional keys) shifts between rollouts without a code change, and downstream handlers silently misread the payload while the JSON itself stays valid.

Why didn't strict mode catch the casing change?

Strict mode constrained JSON types, not key naming. The model returned valid JSON with camelCase keys; the handler read snake_case keys and got undefined. The SDK validator passed unknown keys through because the schema did not set additionalProperties: false.

How quickly should a model rollout be reversible?

Within minutes. Pin the model id per environment, gate the flip behind a feature flag, and run a 1% canary with a shadow handler that diff-checks tool-call payloads against the last known good distribution.

Should tables written by an agent be nullable?

Treat agent-written columns the same as required form fields. Make them NOT NULL or wrap them in a domain guard. An agent writing nulls is the equivalent of a form letting users submit a blank required field.

How do you backfill records broken by a silent agent bug?

Store the raw model output for every tool call. If a downstream handler corrupts a row, you can re-derive the correct values from the original payload without re-running the model or asking the user again.

ai agentsautomationintegrationsarchitectureoperationscase study

Building something?

Start a project