← Blog

AI agents

Capping Claude spend at €1,500 per seat: a real playbook

The CFO opened the Anthropic dashboard on a Tuesday and saw €11,400 in Claude charges for May. We had to cap usage without breaking the agent.

Jacob Molkenboer· Founder · A Brand New Company· 9 Apr 2024· 9 min
Brass counter dial, cream form with green ink stamp, chartreuse tab, brass key on linen, ivory paper surface.

The CFO at a Dutch logistics company opened her Anthropic dashboard on a Tuesday morning and saw €11,400 in Claude charges for May. Seven seats. Average €1,628 each. One seat at €3,890. The agent had shipped in late April and was now answering customer queries, drafting shipment exception emails, and reconciling EDI mismatches. Nobody disputed the value. But the bill had no ceiling, and finance does not like things without ceilings.

We had two weeks to give her a number she could plan against. We landed on €1,500 per seat per month, hard. It is the same ceiling finance was already applying to other paid engineering tools, which made the conversation short. The harder part was capping spend without turning the agent into a useless degraded shell every time someone hit the limit on the 23rd of the month.

This is the playbook. Three pieces: a budget gate that runs before every model call, a fallback chain that swaps to a cheaper model when a seat gets close to its ceiling, and a daily Slack ping that tells finance where the money went before they ask.

Why per-seat beats per-org

Every team we have helped budget Claude starts with the same instinct: put one number on the org-wide API key and cap there. It is the wrong number. One power user (the head of operations who lives in the agent) will eat half the budget, and the other six seats will hit the wall at the end of the month for reasons that have nothing to do with their own usage. When the wall hits, the agent stops working for everyone at once. That is the worst possible failure mode.

Per-seat caps solve a different problem. They give each user a budget that matches their role. They make overage debuggable (one seat is over, not "the agent"). And they give finance a unit cost they can model: seats times ceiling equals worst-case spend, plus a small buffer for context-heavy prompts that slipped through the estimator.

The €1,500 figure is not magic. It is roughly the cost of a paid engineering tool that genuinely replaces a half-day of work per week. Below that, finance treats it like SaaS. Above that, it starts looking like a headcount conversation, and the calculus changes. Pick your own number, but pick one per seat, not one per org.

The budget gate

The gate is a function that runs before every call to Anthropic's API. It does four things: read the seat's spend so far this month, estimate the cost of the next call, decide which model to route to, and write the result back to the ledger after the response returns.

We store the ledger in Postgres because we need transactional writes (input tokens and output tokens always land together) and because finance wants to query it without asking us. One row per call.

create table claude_calls (
  id            bigserial primary key,
  seat_id       text not null,
  billing_month date not null,
  model         text not null,
  task_type     text not null,
  in_tokens     int  not null,
  out_tokens    int  not null,
  cached_in     int  not null default 0,
  cost_eur      numeric(10,4) not null,
  was_fallback  boolean not null default false,
  created_at    timestamptz not null default now()
);

create index claude_calls_seat_month_idx
  on claude_calls (seat_id, billing_month);

The estimator is the only piece that needed real thinking. Input cost is exact because we count tokens before sending, using Anthropic's token-counting endpoint. Output cost is a forecast, because we do not know how long the model will talk. We use the seat's rolling 7-day average output length per task type, multiplied by 1.3 to leave headroom. Over-estimating beats under-estimating, because the gate is what stops a runaway loop from charging €40 for one query.

// budget-gate.ts
const MONTHLY_CAP_EUR = 1500;
const FALLBACK_AT_PCT = 0.85;

const PRICE = {
  "claude-sonnet-4-7": { in: 3.0 / 1e6, out: 15.0 / 1e6 },
  "claude-haiku-4-5":  { in: 0.8 / 1e6, out:  4.0 / 1e6 },
};

export async function pickModel(seatId: string, prompt: Prompt) {
  const usedEur = await spentThisMonth(seatId);
  const pct = usedEur / MONTHLY_CAP_EUR;

  if (pct >= 1.0) throw new BudgetExceeded(seatId, usedEur);

  const model = pct >= FALLBACK_AT_PCT && prompt.fallbackOk
    ? "claude-haiku-4-5"
    : "claude-sonnet-4-7";

  const inTokens    = await countTokens(model, prompt.text);
  const expectedOut = await rollingAvgOutput(seatId, prompt.taskType);
  const projected   = PRICE[model].in  * inTokens
                    + PRICE[model].out * expectedOut * 1.3;

  if (usedEur + projected > MONTHLY_CAP_EUR) {
    throw new BudgetExceeded(seatId, usedEur);
  }

  return { model, projected };
}

The function is short on purpose. Every line of policy you add to the gate is a line that gets debugged at 23:00 when a seat is stuck at "agent is thinking" because of a math edge case. We kept the gate boring and put complexity in the fallback policy instead, where it is easier to reason about and easier to revert.

One gotcha worth naming up front: do not estimate output tokens from the prompt length. The correlation is weak for any task involving structured output (JSON, code, classification lists). A short prompt asking for a 40-row JSON array will produce 5,000 output tokens; a long prompt asking for a yes/no answer will produce 8. Use a rolling per-task-type average, refresh it nightly, and trust it more than your intuition.

The fallback model

When a seat crosses 85% of its monthly cap, the gate stops routing to Sonnet and switches to Haiku for the rest of the month. We picked 85% because finance gets the warning ping at 80%, and we wanted at least one working day between the warning and the model switch. Most of the time, nobody notices the switch happened.

Three rules made the fallback survive contact with real users.

First, the fallback only triggers on tasks the cheaper model can actually do. The agent has six task types: inbox triage, customer reply drafting, shipment exception classification, EDI mismatch reconciliation, dashboard summarisation, and a free-form "ask anything" mode. The first three run fine on Haiku. The last three do not. So when a seat hits 85%, only the first three switch. Free-form mode returns a soft block with a one-line explanation instead. Users tolerate "not this month, sorry" more than they tolerate quietly degraded answers they cannot diagnose.

Second, the system prompt changes with the model. Haiku follows instructions tightly, but it does not infer the same way. We rewrote the inbox-triage prompt twice for Haiku before it produced output the ops team would accept. Same task, different prompt, different cost, same outcome. Sonnet-class quality is not free, and it is also not always required.

Third, every fallback response carries an HTTP header (X-Claude-Tier: fallback) that the front-end uses to render a small grey marker next to the response. Quiet, not alarming. Power users learn what it means within a week. Everyone else ignores it, and that is fine.

Takeaway

A budget cap is only useful if it degrades gracefully. The interesting design question is which tasks survive the fallback, not whether the fallback exists.

The Slack ping that keeps finance calm

The third piece is the one that actually changed the relationship with finance. A daily job runs at 09:00 Europe/Amsterdam, reads the ledger, and posts one message to #claude-spend:

Claude spend, June 3 (day 3 of 30)
Org total:  €1,068 / projected €10,680 (cap €10,500)
Top 3 seats:
  • ops-lead@…    €312  (21% of cap, on track)
  • cs-anna@…     €198  (13% of cap, on track)
  • finance-rob@… €141  ( 9% of cap, on track)
Fallback today:  0 seats
Warnings (>80%): 0 seats
Blocked  (>100%): 0 seats

The message is generated by a 60-line script that reads from the same Postgres ledger. No dashboard, no Looker board, no Notion page. The reason it works is not the format. It is that finance gets the number before they think to ask for it, every day, in the channel they already read for other budget signals. We have not had a single ad hoc "what is going on with Claude" question since the ping went live.

The warning ping at 80% goes to a smaller channel with the seat owner tagged. It is a soft heads-up: "you are tracking high, your fallback will start in roughly four days at this rate". Half the time the seat owner replies with "yes, big shipment week" and we leave it alone. The other half, they look at their own usage and self-correct before anything switches.

What the bill looks like now

The June bill, three days in, is tracking to €10,180 for the month. Down from €11,400 in May, with the same seven seats and 14% more total agent calls. The drop comes from two places: Haiku absorbing the long tail of triage tasks for the seat that previously ran everything on Sonnet, and the power user (€3,890 in May) staying flat at €1,492 through the gate's hard stop.

The agent stayed in production through the whole transition. No incidents, no rollback. The CFO has stopped asking and started using the daily ping as an input to her own forecast. That is the change we were paid for. Everything else was scaffolding.

What we would do differently

Two things.

We would start with the ledger before the cap. The first week was loud and useless because we capped before we measured, and we had no idea what a "normal" seat looked like. Two weeks of just-measure data would have given us a defensible cap based on our own usage instead of one imported from someone else's blog post.

We would also split the cap by task type from day one. Inbox triage at €400/seat, free-form at €600/seat, EDI reconciliation at €500/seat is more useful than a single €1,500 line. Finance prefers it because it maps to processes they already know. Users prefer it because the block message can say "you have used this month's reconciliation budget" instead of "you have used this month's Claude budget". The code is barely longer, and it gives you a much better answer to the question "where did the money go" than a single column of total spend.

The smallest thing to do today

Open your Anthropic console, sum your last 30 days of spend per API key, and divide by active users. If that number is over €1,500, you have a budget question, and the ledger is the first thing to build. If it is under, you have a measurement question, and the ledger is still the first thing to build. Everything in this playbook hangs off that one Postgres table.

When we built the agent platform for the logistics client above, the gate was the second-smallest piece of code we shipped and the one finance cared about most. We design AI agents for European mid-market companies that need to ship without scaring their CFO, and the budget gate is the part we now build on day one rather than month three.

Key takeaway

Cap Claude spend per seat, not per org, and design the fallback so users notice the gate as a small feature, not a wall.

FAQ

What if Anthropic's prices change mid-month?

Update the PRICE map and run a one-time backfill on past calls. The ledger stores actual cost in euros at the time of the call, so historical totals stay accurate even after a price change.

Does the gate account for prompt caching discounts?

Yes. Add a cached-input price field, read cache_read_input_tokens from the response, and record it in the ledger. The cost column should reflect what you were actually billed, not list price.

Why Postgres for the ledger and not Redis?

Finance wants to query it. Redis is fine as a per-seat counter cache, but the audit trail needs durable, queryable storage. We use Postgres for both and let it serve the recent counter rows from memory.

What happens when a seat is fully blocked?

The UI shows a clear monthly-cap message with a link to request an override. The override request goes to the seat owner's manager, not to engineering, because the cap is a budget decision, not a technical one.

Can the same pattern work for OpenAI or other model providers?

Yes. The gate, ledger, and Slack ping are provider-agnostic. Only the PRICE map and the token-counting call change. We use the same shape across providers in mixed-model deployments.

ai agentsoperationsarchitecturetoolingautomationbusiness

Building something?

Start a project