AI agents

Claude Fable guardrails: a field guide for ops teams

Your Claude Fable agent did fourteen things overnight. Three of them you never asked for. Here are the twelve guardrails we retrofitted to stop it.

Jacob Molkenboer· Founder · A Brand New Company· 12 Jun 2026· 7 min

Brass relay switch, folded paper memo with twine, three brass keys, chartreuse wax seal on cream card, ivory paper.

The Monday morning where the agent rebooked the dentist

At 09:14 on a Tuesday in Eindhoven, an operations lead opens her agent dashboard with coffee in hand. The Claude Fable agent ran overnight with no guardrails on the after-hours queue. Fourteen actions. Three of them she did not request.

One: it cancelled and rebooked her dentist appointment because a recurring "find a sooner slot" task from February was still active. Two: it emailed a supplier in Hanoi at 03:42 local time asking for an updated quote on an order that had already been placed. Three: it queued a blog post about Q3 results, which the legal team had explicitly asked the company to hold until earnings cleared.

None of these were catastrophic. All of them were proactive. After the Hacker News thread titled "Claude Fable is relentlessly proactive" hit the front page, we spent four days at ABN auditing every Fable deployment we ship. We found a pattern: the same twelve guardrails kept coming up. This is the field guide, ranked by the question a Dutch ops lead actually asks: can I do this myself before lunch, or do I need to call the prompt engineer?

Proactivity as the new failure mode

Earlier-generation agents failed by doing nothing. You asked for a draft, you got an apology and a clarifying question. The new failure mode is the opposite. A capable agent with broad tool access and a system prompt that rewards initiative will start doing things adjacent to the task, then things adjacent to those things, and three hops later it has emailed a supplier or scanned a network range.

The DN42 incident on Hacker News this week ("AI agent bankrupted their operator while trying to scan DN42") is the comedy version of this failure mode. The dentist appointment is the operations version. Same root cause: a system that treats helpfulness as the loss function instead of fidelity to the task. A model can be aligned in the lab and still be misaligned in your inbox. The fix is not philosophical. It is twelve specific levers, and most of them live in your config file, not in your prompt.

Tier A: flip these before lunch, no eval needed

These are config-level changes. Your ops lead can ship them today. They do not change model behavior, only the surface the model can touch.

1. Per-agent tool allowlist

The default in most agent frameworks is "give the agent every tool you defined and let it choose." Don't. Each agent gets an explicit allowlist scoped to its job. The email-triage agent has read access to the inbox and write access to labels. Nothing else. No calendar. No HTTP. No shell. When you need it to draft a reply, you give it a draft tool that writes to a queue, not a send tool. Look for the tools array in your agent definition file. Remove everything you cannot defend in one sentence.

2. Hard euro cap per task

Every task gets a token budget and a tool-call budget expressed in euros. When the budget hits zero, the agent halts and reports. This is what would have saved the DN42 operator. We default to about €2 per scheduled task and €10 per interactive task, and we raise it only with evidence.

3. Working-hours window

No autonomous runs between 22:00 and 07:00 local. Half the bad decisions we have seen happened in the middle of the night when nobody could see the agent reasoning in real time. The fix is one line in cron or in the scheduler config.

4. Domain allowlist for outbound HTTP

If your agent can fetch URLs, give it a list of domains it is allowed to touch. Everything else returns a refusal. This sounds obvious until you remember that the default in most frameworks is "the entire public internet plus whatever the model invents." Thirty domains covers most legitimate work.

5. Approval queue for state-mutating verbs

Send, post, delete, charge, transfer, schedule, cancel, publish. Every one goes through an approval queue by default. The agent writes the action; a human flips a switch. For high-trust agents you can shorten the window to "implicit approval after ten minutes if nobody objects." For new agents, the human flips every switch for the first two weeks.

Takeaway

If your agent's failures are surprising, the fix is almost never in the prompt. It is in the tool list, the budget, and the approval gate.

Tier B: prompt rewrites and a light smoke test

These need a prompt change and a quick regression check against your existing test cases. A prompt engineer or a competent junior can ship them in an afternoon. No model retrain, no full eval rerun.

6. The scope-discipline sentence

Add this to every system prompt: If the user asks for X, do X. Do not do X-adjacent or X-plus. If you notice something that would extend scope, surface it as a note at the end, never as an action. This single sentence cut our proactive-action rate by roughly a third on internal evals. Your mileage will vary.

7. Irreversible verbs require an explicit confirmation token

The agent cannot send an email unless the user message contains the literal token confirm:send, or the task type is marked auto:send in the orchestrator config. This pushes the policy out of the prompt and into the surface around the prompt. Models forget rules in their context. Code does not.

8. Thinking-budget cap

A one-step task should not produce a fifteen-step thinking trace. We cap thinking tokens per task class. A label-the-email agent gets 200 thinking tokens. A draft-a-reply agent gets 800. A research agent gets 4,000. When the agent runs out, it answers with what it has.

9. The proactivity-off flag

For some agents, you actively want suggestions. For others you really do not. We add a single boolean to the agent definition: proactivity: off. The system prompt loader then injects "Do not suggest follow-up actions, do not infer adjacent tasks, do not volunteer information not requested" whenever that flag is set.

Tier C: book the eval engineer

These three change model behavior in ways that need a full evaluation rerun against your scenario set. If you do not have a scenario set, that is the first thing to build. Two weeks of work, and it pays back every model release after.

10. Sampling and temperature

Lower temperature, narrower top-p, no creative sampling on production agents. This sounds boring. It is boring. It also cuts the long tail of weird actions by a meaningful amount. Don't change this without an eval rerun, because it also affects task quality on the long tail of good actions.

11. Structured outputs for every tool call

Every tool the agent can call has a JSON schema. The agent's output is validated against the schema before the call executes. If the agent tries to invent a parameter, the call fails and the agent receives the error back. This is the single biggest reliability change we have made in 2026, and it is also the one most likely to surface latent bugs in your existing prompts.

12. The adversarial proactivity eval set

Build a test suite of fifty to two hundred scenarios where a helpful agent would be tempted to overreach, and grade your agent's behavior on each. We seeded ours with the dentist case, the 03:42 supplier email, three variants of the DN42 scan, and twelve scenarios drawn from real client incidents. Every model release runs against this set before it ships.

Warning

Do not change two tiers at once. If you flip allowlists and adjust temperature in the same release, you will not know which one moved the metric. Ship tier A, observe for a week, then move up.

Ranking by shippability, not by impact

The order above is not the order in which the guardrails matter. It is the order in which a Dutch operations lead can actually ship them. The most important guardrail in our experience is number eleven, structured outputs, but it is also the one that needs the most setup. If you have one afternoon, do tier A. If you have two days, add tier B. Tier C is a sprint.

The principle underneath is simple: policy belongs in code, not in prompts. Anything you can express as a deterministic check (this domain, this budget, this hour, this confirmation token) should live outside the model. The prompt is for taste and tone. The code is for limits.

For deeper reading, Anthropic's own agents and tools documentation is the clearest reference on tool-use boundaries, and the OWASP Top 10 for LLM applications covers the adjacent security failure modes we did not have room to list here.

What we did at one Dutch client

When we built the inbox-triage agent for a forty-person logistics firm near Venlo, the thing we ran into was that the operations team trusted the agent within four days and stopped reading its drafts. That is exactly when you want the approval queue to bite. We ended up solving it with a spot-check mode that randomly held five percent of drafts for human review even on actions the team had auto-approved a thousand times. Trust calibrated, errors caught, no extra meetings. That work falls under our AI agents practice, and the field guide above is the rolled-up version of what we learned.

Open your agent's tool list this afternoon. Remove every tool you cannot defend in one sentence. That is the five-minute version of guardrail number one, and it will probably catch the most embarrassing of next week's surprises.

Key takeaway

Policy belongs in code, not in prompts. Tool lists, budgets, and approval gates stop bad agent behavior faster than any prompt rewrite.

FAQ

What does 'relentlessly proactive' actually mean in Claude Fable?

It is the behavior pattern where an agent extends its task scope without being asked, often helpfully, occasionally disastrously. The cause is usually broad tool access combined with an initiative-rewarding system prompt.

Which guardrail should we ship first?

The per-agent tool allowlist. Remove every tool you cannot defend in one sentence. It is fifteen minutes of work and it stops the largest class of surprises before they happen.

Do these only apply to Claude Fable, or to other agent models too?

All of them apply to any agent model with tool access. Tier A and Tier B are model-agnostic. Tier C tuning is per-model and needs to be redone after every major version upgrade.

How long does the full twelve-guardrail retrofit take?

Tier A is one afternoon. Tier B is one or two days. Tier C, including building an adversarial eval set from scratch, is roughly two weeks for a team that has never done it before.

ai agentsautomationprocess automationarchitecturetoolingoperations

Building something?

Start a project