AI agents

AI agent spend caps: the €4,820 Sunday DN42 incident

At 06:14 on a Sunday, a customer-success agent in Eindhoven started a DN42 reconnaissance loop. Eleven hours later it had burned €4,820 on one ticket worth €189.

Jacob Molkenboer· Founder · A Brand New Company· 10 May 2025· 9 min

Brass pneumatic tube on bone paper, chartreuse tag, paper ticket, brass relay, pocket watch, leather desk pad.

The alert hit at 06:14 on a Sunday. Egress on the customer-success agent had crossed €1,000 inside its first six hours alive. By 09:30 the counter sat at €1,840. By the time the engineering lead at the Eindhoven SaaS killed it at 17:32, it read €4,820.

The agent had been in production for five days. Twenty-nine-person company, mid-tier B2B SaaS, a standard customer-success copilot wired into Intercom, Stripe, the product database, and (this is where it gets interesting) an outbound shell tool labelled "diagnostics". The ticket that started the loop was the most ordinary thing on the queue: a paying customer reported their webhook deliveries were arriving late.

The story is on the HN front page this week. It is also exactly the shape of incident we have spent the last eighteen months designing around for our own deployed agents. So here is the walkthrough, and the orchestrator pattern that would have caught it on the first retry.

What the agent thought it was doing

The customer's webhook endpoint sat behind a self-hosted reverse proxy on a small VPS in Frankfurt. The agent, asked to confirm reachability, ran curl. The curl failed with a TLS handshake error. The agent reasoned, correctly, that the next step was to check the underlying network path. It ran traceroute. The traceroute showed asymmetric routing through an autonomous system the agent did not recognise.

At this point a human engineer would have stopped, opened Slack, and asked the customer for their network setup. The agent did not have a Slack tool. It had a shell. So it kept going.

The path it chose was DN42, a decentralised network experiment that runs overlay VPN tunnels and BGP between hobbyist autonomous systems. The customer's VPS provider happened to have a DN42 peer in the same datacentre. The model, having read the routing table dump from traceroute, decided the cheapest path to confirming the customer's network setup was to enumerate the DN42 AS list, locate the peer, and route a test through it.

That decision was internally coherent. It was also wrong by every other measure. DN42 does not carry production traffic. The peer was a side project belonging to a sysadmin in Bremen. The actual webhook delay was a misconfigured retry interval, sitting in plain text in the customer's Stripe dashboard, twelve clicks from where the agent already had read access.

The cost shape of an eleven-hour loop

The €4,820 broke down like this:

€2,940 in outbound bandwidth from the AWS instance hosting the agent runtime. Eleven hours of near-constant scanning, with retries and back-off retries layered on top.
€1,180 in model inference. Each scan returned a wall of text. The agent re-fed all of it into context on every tool call, with no summarisation step in between.
€520 in third-party API fees. The agent had a "lookup AS owner" tool that pinged a paid WHOIS service. It called it 4,712 times.
€180 in downstream alerting noise (pager events, second-tier on-call). Real money once you count humans on a weekend rate.

Look at the inference line. €1,180 of tokens on a single ticket whose lifetime value is €189 a month. The model running the loop was not a frontier tier and not the premium long-context variant. It was a mid-tier production model with a context window that kept growing because nobody had told it to compress.

A growing context plus an open shell plus no per-tool budget is not an edge case. It is the default shape of every off-the-shelf agent framework we have audited this year.

The root cause was not the model

The post-mortem in the company's internal Notion frames this as "the model went rogue". That framing is wrong, and it is the framing that will get the next team into the same hole.

The model did exactly what its tools and its halt condition allowed. The orchestrator the team built was the standard shape you get from following any popular agent tutorial: a planner, a tool list, a working memory, and a halt condition keyed off "task complete". The halt condition was the only budget gate. There was a global daily cap of €500, with alerting set to "warn at 80%", routed to a Slack channel nobody watched on Sundays.

The team had read OWASP LLM10: Unbounded Consumption. They had not implemented anything against it. That is, from what we see in our migration audits, the median state of agent deployments in mid-2026.

Three guardrails the orchestrator was missing

Per-tool budget

Every tool in your agent's toolbox has a cost shape. A read from your product database costs a few cents in inference and zero in third-party fees. A WHOIS lookup costs €0.11 per call. A shell command can cost anything from zero to infinity depending on what it runs. The orchestrator has to know all three, and it has to refuse a call before it makes one.

TOOL_BUDGETS = {
    "read_db":      {"per_call_eur": 0.02, "per_ticket_cap_eur": 0.40},
    "whois_lookup": {"per_call_eur": 0.11, "per_ticket_cap_eur": 0.55},
    "shell":        {"per_call_eur": None, "per_ticket_cap_eur": 1.00},
    "send_email":   {"per_call_eur": 0.00, "per_ticket_cap_eur": 0.00, "max_calls": 3},
}

class BudgetExceeded(Exception):
    pass

def invoke_tool(tool_name, args, ticket_state):
    budget = TOOL_BUDGETS[tool_name]
    spent = ticket_state["tool_spend"].get(tool_name, 0.0)
    estimated = budget.get("per_call_eur") or estimate_cost(tool_name, args)
    if spent + estimated > budget["per_ticket_cap_eur"]:
        raise BudgetExceeded(
            f"{tool_name} would exceed cap "
            f"({spent + estimated:.2f} > {budget['per_ticket_cap_eur']:.2f})"
        )
    result = TOOLS[tool_name](**args)
    cost = actual_cost(tool_name, result)
    ticket_state["tool_spend"][tool_name] = spent + cost
    return result

The shell tool's per_call_eur is None because you cannot price it ahead of time. Estimate after parsing the args (a curl to a known endpoint is cheap, an nmap of a /16 is not), or refuse to estimate and fall back to a hard call-count cap.

Per-ticket budget

One ticket has a value. The Eindhoven customer pays €189 a month. The agent spent enough on one ticket to refund two and a half years of subscription. A per-ticket ceiling of €2.00 would have killed the loop before the third WHOIS call. The number does not have to be sophisticated. It has to exist.

Loop signature detection

The agent called traceroute four hundred and seven times. Four hundred and seven. The second of those calls should have raised an eyebrow at the orchestrator. The fifth should have halted execution and escalated to a human.

from collections import Counter

def detect_loop(call_history, window=10, threshold=4):
    recent = call_history[-window:]
    sigs = Counter((c["tool"], c["args_signature"]) for c in recent)
    top = sigs.most_common(1)
    if top and top[0][1] >= threshold:
        return top[0][0]
    return None

The args_signature is a hash of normalised arguments (strip varying timestamps, collapse adjacent IPs to their /24). Four identical-shaped calls in the last ten is your trip wire. Halt the loop, log the trail, page a human. Cheap to write, cheap to run, would have ended the incident by 06:20.

What we run in production

The orchestrator we deploy for client agents sits between the model and the tools. It does three things the team in Eindhoven did not.

First, every tool call goes through a single invoke_tool path that records cost in cents and refuses calls past a hard ceiling. Not soft, not warn-at-80%, not Slack-on-friday. The call returns a BudgetExceeded error string back to the model, which then has to reason its way to a different plan or hand off to a human.

Second, the agent runs against a context budget too. Every fifth tool result gets summarised into a 200-token note and the raw result is dropped from context. This alone would have killed roughly €700 of the Sunday's inference bill.

Third, the halt condition is not "task complete". It is a vote across three signals: the model's own claim, a cheap classifier that reads the last three tool results, and a hard wall on total spend. If two of three say stop, the loop stops. The model does not get to talk its way past the wall.

Takeaway

Your orchestrator's job is not to help the agent finish the ticket. Its job is to refuse the agent's next call when the call costs more than the ticket is worth.

The case for boring middleware

There is a thread in the current discourse about whether more proactive models reduce the need for orchestrator-level guardrails. The HN front page this week has a separate piece arguing that the newest tier of one major model is "relentlessly proactive", and another about its invisible guardrails. The thread the Eindhoven incident lands on is the inverse of that argument. More proactive models make orchestrator budgets more important, not less. A model that will try harder will try harder for longer and with more tools.

The middleware that catches this is not glamorous. It is a function that refuses a call. It is a counter that increments on each tool invocation. It is a regex over the last ten tool signatures. None of it requires a new model. All of it requires deciding, before deployment, that every tool has a price and every ticket has a ceiling.

The handover

When we built the customer-success agent for a Dutch logistics client earlier this year, the thing that nearly tripped us in week two was almost identical: a shell tool, an open-ended ticket, and a model that wanted to keep going. We solved it by moving the budget check out of the model's prompt and into the call layer, where the model cannot argue with it. That pattern is what we now ship by default in our AI agents engagements.

The smallest thing you can do today: open your agent's tool registry, add a column for "cost per call in cents", and another for "calls per ticket cap". Fill in the numbers from last week's logs. If any tool has no number, you have your first incident already half-written.

Key takeaway

Every tool needs a price, every ticket needs a ceiling, and the budget check belongs in the call layer where the model cannot argue with it.

FAQ

What is DN42?

A decentralised network experiment that runs overlay VPN tunnels and BGP between hobbyist autonomous systems. It is not a production network and was never designed to carry customer traffic.

How small should a per-ticket agent budget be?

Tie it to the ticket's economic value. For a €189/month subscription, a €1 to €2 ceiling is reasonable. The exact number matters less than having one at all.

Will newer model tiers solve this without orchestrator guardrails?

No. A more proactive model tries harder for longer with more tools. Spend caps belong outside the prompt, in the call layer, where the model cannot argue with them.

Do the major SDKs enforce spend caps natively?

As of mid-2026, no. Vendor SDKs return token counts but leave budget enforcement to your orchestrator code. You have to write the call-layer middleware yourself.

ai agentsoperationsarchitecturecase studytoolingautomation

Building something?

Start a project