← Blog

Strategy

Reserved-capacity AI contracts: a field-guide to 16 mistakes

Friday, 23:47, second cup of decaf, the HN thread is 612 comments deep, and on Monday the agency owner signs a two-year reserved-capacity commit he shouldn't.

Jacob Molkenboer· Founder · A Brand New Company· 19 Jun 2026· 9 min
Half-open green leather folio, brass pen on linen, chartreuse wax seal, brass scale, cold coffee cup on ivory desk.

Friday, 23:47. The agency owner is on his second cup of decaf, scrolling Hacker News with a docked iPad on the kitchen counter. The thread is 612 comments deep and titled something like ACE-spec compute quotas: what you actually pay after the discount tier. Someone has pasted a screenshot of a vendor quote: two-year reserved-capacity commit, three-figure-thousands per month, "guaranteed throughput." The CFO has been asking, gently, why the OpEx line for AI doubled in Q1. The owner sleeps badly and signs the term sheet on Monday.

The contract will outlive two of his customers.

This is a field-guide to the sixteen places that decision goes wrong. We have watched the pattern enough times — fourteen agents in production for our own clients, plus the audits we run for other studios — that the failure modes are boringly predictable. We sorted them by what it costs to fix: a config-PR you can ship before lunch, a sprint of refactor, or a customer recontract round that costs you a quarter and a relationship.

What the thread actually said

Before the sixteen mistakes, a flag on the source material. Most HN compute-quota threads are not arguments for or against reserved capacity. They are arguments about a specific arithmetic: at what monthly burn does a committed-throughput rate beat pay-as-you-go, and what does the vendor count as your tokens when measuring against the commit. The arithmetic moves every two months. The thread you read in May is wrong by July.

A Dutch sub-€12M agency runs, in our experience, between 8 and 22 customer-facing agents. Average sustained burn is small: a few million tokens per day across all customers, almost all of it cacheable. The math for a reserved commit at that scale is rarely in your favour. The math for a thin router across three providers, with a Hetzner GPU node handling embeddings and the long tail of cheap calls, almost always is. Hetzner's GPU dedicated servers start well under €200 per month for an entry RTX 4000 SFF Ada; for an agency of your size that line item alone replaces the embedding bill.

So: sixteen mistakes. Numbered because they are countable, ranked by cost to undo.

Tier one — fixable in a config-PR before standup

These are the ones to find first. They do not require a customer conversation. The CTO opens the router config, opens a PR, ships before lunch.

1. The model name is hardcoded in three places. Search the repo for the literal model string. If you find it in code rather than in a single config map, you cannot move spend between providers without a deploy. Rename to an alias (tier-fast, tier-deep, tier-cheap) and resolve at the edge.

2. No per-customer token budget. One customer's runaway loop is paying for everyone else's commit. Add a monthly token ceiling per customer, log overruns to Slack, fall back to the cheap tier when crossed.

3. Prompt caching is off. If your system prompt is 4,000 tokens and you do not have caching enabled, you are paying full input rate every turn. The vendors that support it document it well — Anthropic's prompt caching docs are the clearest. A single header flip drops input cost by an order of magnitude on long-context agents.

4. Every call goes to the flagship. Customer email triage does not need a frontier model. Route by task: classify and summarise on a small model, escalate to flagship only when the small model's confidence is low. Two routing rules, twenty lines of code.

5. No fallback URL. When the provider 5xx's at 14:00 CET — and they do — your agent dies and the customer sees it. The router should have a secondary base URL per task tier. The fallback does not have to be cheap. It has to be there.

6. Eval traffic is hitting prod endpoints. Your nightly regression suite is consuming the same quota as live customer agents. Move evals to a separate API key with its own budget. You want the alert when evals overrun; you do not want them eating your commit.

Takeaway

If you cannot move 30% of spend between providers by editing one file, you do not have an AI architecture. You have a vendor lock-in disguised as one.

Tier two — a sprint of refactor

These cost a developer-week. They will not break a customer contract, but they need testing, staging-environment time, and a real rollout.

7. No router at all. The SDK is imported in eleven service files, each making direct calls. There is no place to add a model, a fallback, a budget, or a log redactor. Build the router: a single function that takes (tier, messages, customer_id) and returns a completion. Everything else routes through it. Once it exists, the next twelve months of optimisation become config changes.

8. Embeddings billed at API rates. A 1024-dim embedding model runs comfortably on a single GPU you rent for around €200 a month. If you are paying per-token API rates for embedding the same product catalogue every night, the GPU pays itself back inside a month. Use one of the recent open-weights models, batch the work, expose it behind the same router.

# rough sizing — your numbers will differ
# 10M embed tokens/day @ €0.10/1M = €30/day = €900/month
# vs Hetzner GPU node                = ~€184/month, headroom to spare

9. The cheap tier is a cheap model. It should be a cheap path. Most calls in a customer-service agent are not creative. They are retrieval plus format. Run those against your own RAG with a small model and never hit the foundation vendor at all. The flagship is for the ambiguous 8% of turns.

10. Logging full prompts and full completions, indefinitely. This is technically a tier-one fix but it has tier-three consequences if you skip it. Redact PII at the router, set a 30-day retention, document the policy. The day you have to recontract is the day a customer asks where their data went. Have an answer that is not "uh."

11. Model versions pinned to strings the vendor will deprecate. Foundation-model vendors retire snapshots on rolling schedules. If your contract names a specific snapshot and your code calls that same snapshot and the vendor sunsets it next quarter, you have two crises that did not need to be the same crisis. Pin the alias, not the version, in customer-facing language.

12. Fine-tuning artefacts you cannot move. Some vendors will fine-tune on your data and call the result "yours." Read the export clause. If you cannot download the weights or the equivalent prompt-distillation, you do not own it. We have helped two clients out of this. It is a slow, polite, expensive divorce. Avoid the wedding.

Tier three — the customer recontract round

These are the expensive ones. They typically force a conversation with every customer whose SOW touches the agent, and they often force a price renegotiation, because the original price assumed the vendor's commit math.

13. The SOW names a model by brand. "Powered by [Vendor X]'s [Model Y]" reads well in a sales deck. It reads badly in court when Vendor X raises prices 40% and your customer says they signed for Model Y, not Model Z. Sell the outcome (response time, accuracy on the eval set, hours saved). Never sell the brand on the inside of the box.

14. Data-residency clauses that name one vendor's region. When you wrote "data processed in EU-Frankfurt by [Vendor]" you closed the door on the other two providers, both of whom also offer Frankfurt processing under different names. Write the clause around the residency, not the route.

15. Per-seat pricing baked on volume-discount math. Your customer pays €X per seat. That price assumed the vendor's tier-three commit discount. If the customer's usage does not show up — or shows up at the wrong tier — your margin is a rounding error. Price on tokens-consumed-by-the-customer, with a generous included bundle, not on a static seat fee that hides the vendor cost.

16. The reserved-capacity commit itself. The big one. Multi-year, throughput-guaranteed, take-or-pay. If you signed this before you had six months of router data showing your real burn distribution by customer and by task tier, you signed it blind. The fix is not technical. The fix is calling the vendor's account team and asking, honestly, to restructure to a shorter term with a usage-credit conversion. Some will. Some will not. You will know inside a week.

Warning

The reserved commit is rarely the worst mistake in absolute money. It is almost always the worst mistake in optionality. Optionality compounds; money is a line item.

The order to fix them in

Not the order they are numbered. Fix tier-one first, all of it, in one PR cycle. You will recover enough margin in week one to fund tier two. Tier two builds the router and the GPU node — two to three sprints. Only then, with real data in hand, do you open the tier-three conversations with customers and the vendor.

If you do it the other way around — start with the painful customer conversations because they feel most urgent — you will have those conversations with no leverage, because you cannot yet show the customer or the vendor a credible alternative.

What we keep relearning

A foundation-model vendor's salesperson is paid on commit volume. An HN thread is paid on engagement. Your CFO is paid on the cash position. None of these three is paid on the long-term shape of your agency. You are. The router is the cheapest insurance you will ever buy for your right to change your mind.

When we rebuilt the agent stack for a Dutch insurance broker last autumn, the thing we ran into was a previous vendor commit locking routing to one provider's region — so we put a thin router in front, treated the existing commit as a paid-up bucket to drain, and moved embeddings to a Hetzner GPU box. Six months in, the broker's spend was 46% lower on the same agent count; the work sits inside our AI agents practice.

Five minutes, tomorrow morning

Open the router config. If you do not have one, open the file that imports the SDK. Grep for the literal model name. Count the occurrences. That number is your tier-one mistake count, before you have even started reading the contracts.

Key takeaway

If you cannot move 30% of your AI spend between providers by editing one file, you have a vendor lock-in disguised as an architecture.

FAQ

Should a sub-€12M agency ever sign a reserved-capacity AI commit?

Rarely, and never before six months of router telemetry. Most agencies your size are better served by pay-as-you-go across two or three providers plus a small GPU node for embeddings.

Which providers do you actually route across?

We deliberately do not name them in client SOWs. The point of a router is that the answer changes every quarter without breaking a contract. Three is the practical sweet spot: one flagship, one fast, one cheap.

Is a €200/month Hetzner GPU really enough for production embeddings?

For most agency workloads, yes. A single entry-tier card handles millions of embedding tokens a day if you batch the catalogue refresh overnight. Live query traffic is a fraction of that.

How long does it take to build the router?

For a team that knows the SDK, one to two sprints to a production-grade version with fallback, budgeting, and PII redaction. The expensive part is rewiring eleven scattered call sites, not the router itself.

What if we already signed the multi-year commit?

Call the account team this week. Many vendors will restructure to a shorter term or convert remaining commit into usage credits, especially if you can show you intend to grow the relationship rather than exit it.

ai agentsstrategyarchitectureoperationsbusinesstooling

Building something?

Start a project