AI agents
OpenAI agent loops: the three places your budget burns
A founder pings us on Slack at 9:14 on a Tuesday. The OpenAI bill for last month is 4,200 euros. The agent does the same job it did in March, when the bill was 600.

A founder pings us on Slack at 9:14 on a Tuesday. The OpenAI bill for last month is 4,200 euros. The agent does roughly the same job it did in March, when the bill was 600. Nothing in the prompt has changed. The model has not changed. Traffic is up maybe 30%. The bill is up 700%.
This conversation happens roughly twice a month with different numbers. The cause is almost always the same three things, in some combination.
The bill is rarely the model
When the OpenAI invoice jumps and the workload did not, the model is rarely the cause. The cause is the loop around it. Agents are loops by definition. Every loop has somewhere it can leak.
Three failure modes account for most of the bleed we see in production. They are independent and they compound. A well-behaved agent on the same model at the same volume can cost 5 to 20x what the dashboard quote suggested, and you will not see it on per-call cost. You will only see it on the monthly invoice.
Context that grows every turn
The first leak is context. Each turn of an agent loop sends the full message history back to the API. If the agent has called eight tools, each returning four kilobytes of JSON, by turn twenty you are sending roughly 640 KB of input tokens per call. The output is a hundred tokens. The bill is the input.
At gpt-4o-mini input rates this is cheap. At gpt-4.1 it stings. At o3-class reasoning models it is the entire budget. The fix is not the model. The fix is what you keep in the context window.
Prompt caching helps if you wire it correctly. OpenAI returns a discounted price on any input prefix that repeats inside roughly five minutes (prompt caching docs). The catch is that the prefix has to be byte-identical. One timestamp at the top of your system prompt, one rotating tool result above the cache line, and your hit rate drops to zero. Most agent code we audit leaves caching on the table by accident.
The pattern that works: a stable system prompt, then stable tool definitions, then a rolling summary, then the current turn. The summary is regenerated only every N turns, not every turn. The rotating bits live at the bottom of the context, where they belong.
Tool calls that retry their own failure
The second leak is tool retries. An agent calls a tool. The tool returns an error. The agent decides to try again. Same arguments. Same error. The agent decides to try once more. By turn six it has spent more on retries than the actual task would have cost end to end.
We saw this last month on an inbox-triage agent. The calendar API returned a 429 on a Monday morning peak. The agent retried six times in a row, each retry adding the failed response to the context. The token count for that task was 38,000 instead of the usual 4,000. Multiply that by a few hundred Monday-morning tasks and you have your spend graph.
If your tool layer cannot tell the difference between a transient error and a fatal one, the model will treat them all the same. Wrap your tools so they return a structured error: retryable, suggested wait, suggested alternative. Cap the retry budget at the loop level, not in the prompt.
The model you picked for the wrong job
The third leak is using one model for every step. Agent loops have at least three distinct cognitive jobs in them: routing (which tool, which branch), generation (what to put in the tool arguments or the response), and synthesis (compress N tool results into one answer for a human).
Routing is a 50-token decision. It does not need o3. Generation usually does not either. Synthesis sometimes does, when the final answer goes to a human and the inputs are messy.
Splitting an agent loop across two or three models is the single highest-leverage change we make in production. A typical setup is a small model (gpt-4o-mini or gpt-5-mini class) for routing and argument generation, a larger model only for the final synthesis step. The routing layer is calling the API ten times per task. The synthesis layer is calling it once. Putting the cheap model where the volume is matters more than the per-call cost on either tier.
The current OpenAI pricing tiers shift often, but the input-to-output cost ratio between tiers (roughly 6x to 20x at the moment) is the lever to watch.
What to measure first
Three numbers, an afternoon to wire up.
Tokens per task, not per turn. A turn-level metric hides loop length. A 20-turn task at 8,000 tokens per turn is the same per-call number as a 4-turn task at 40,000 tokens per turn. The fixes are different. Track input and output separately, because their cost structure and their root causes are different.
Turns to completion at P50 and P95. Averages lie when five percent of your tasks loop for forty turns. The P95 is where the budget goes, and the P95 is what you optimise.
Cache hit rate on input tokens. If this is under 50%, your cache key is broken. If it is over 80%, caching is working and the leak is somewhere else. Easy to read, hard to fake.
A fourth number worth wiring up is tool-call success rate per attempt. If this is below 90% on any single tool, the agent will spend more time arguing with that tool than doing its job.
The instrumentation is unglamorous. Wrap your client call once and log against a task ID you control:
import json, time
from openai import OpenAI
client = OpenAI()
def log_event(event):
print(json.dumps(event))
def chat_with_telemetry(task_id, model, messages, tools=None):
t0 = time.time()
resp = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
)
u = resp.usage
details = getattr(u, "prompt_tokens_details", None)
cached_in = getattr(details, "cached_tokens", 0) if details else 0
log_event({
"task_id": task_id,
"model": model,
"ms": int((time.time() - t0) * 1000),
"in_tokens": u.prompt_tokens,
"in_cached": cached_in,
"out_tokens": u.completion_tokens,
})
return resp
Aggregate by task_id at the end of each run. You will see, within a day of traffic, which loop type is bleeding and why. Most teams discover the answer is not the one they would have guessed.
When the OpenAI bill jumps and the workload did not, the model is not the cause. The loop around it is. Measure tokens per task, turns to completion, and cache hit rate before you change anything else.
The cheapest fix is the one you can see
When we built the inbox-triage agent we mentioned at the top, the monthly spend dropped from 4,200 euros back to roughly 600 after we wired these three numbers up and split the loop across three models. Running AI agents in production without per-task token accounting is flying instrument-free.
The smallest thing you can do today: open your OpenAI usage dashboard, sort by model, and check whether one model accounts for more than 80% of your spend. If it does, and you are using it for routing decisions, you have your afternoon's work.
Key takeaway
If your OpenAI agent bill jumped without the workload changing, the model is not the cause. Measure tokens per task, turns to completion, and cache hit rate first.
FAQ
How much can OpenAI agent costs realistically drop with this approach?
We typically see 60 to 85 percent reductions on production loops after instrumenting tokens per task, fixing the cache key, and splitting routing onto a cheaper model. The exact number depends on which leak dominates.
Does prompt caching help all OpenAI models?
It applies to most current chat completion models. The discount only triggers when the input prefix matches byte-for-byte within roughly five minutes. Rotating timestamps at the top of system prompts kill it silently.
Should I use a different model per agent step?
Often yes. Routing and argument generation rarely need a frontier model. Reserve the expensive model for the synthesis step where the answer goes to a human. Two or three models per loop is normal in production.
What is the fastest way to measure token spend per task?
Wrap your OpenAI client call to log prompt_tokens, completion_tokens, and cached tokens against a task_id you control. Aggregate at task end. An afternoon of work, no vendor tooling required.