AI agents

Agent observability: seven signals we wire before launch

Every agent we ship gets the same seven dashboards before it sees a customer. Here is what we monitor, the Grafana queries we run, and the thresholds that wake us up.

Jacob Molkenboer· Founder · A Brand New Company· 7 Jun 2026· 9 min

Brass switchboard with seven cream paper tags, one glowing chartreuse patch cord, red wax seal on ivory paper.

It is 2:17 in the morning when the on-call phone goes off. An invoice-chase agent has burned €1,400 in model spend over the last forty minutes. A Shopify webhook started returning empty arrays at 1:31, and the agent decided that on the next loop it would find the missing order. It never found it. It kept looking, and it kept paying for the privilege.

That is the failure mode agent observability has to catch before it happens. Not the model getting confused. The infrastructure under the model going sideways while the model keeps talking.

This is the field guide for the seven observability panels we wire into every agent dashboard before a single customer reaches it. Each one with the Grafana query we actually run and the threshold that actually pages us.

The instrumentation under the panels

Before the seven panels, a word on what produces the data. Every agent we ship in 2026 emits OpenTelemetry traces for each conversation and Prometheus metrics from a thin middleware that wraps the model client and every tool. The tracing layer gives us per-conversation forensics when an alert fires. The metrics layer gives us the dashboards.

The counters and histograms below all live in that middleware. You write them once. Every agent the team ships afterwards inherits the dashboard for free. If you take one structural choice from this post, take that one.

1. Cost per resolved conversation

Token spend is not billing trivia. It is the cleanest cost signal in agent observability, and the cleanest sign that an agent is stuck.

A healthy customer-support agent at the studio costs us around €0.04 per resolved conversation on Sonnet 4.5, with a long tail running to about €0.12. Anything over €0.30 is either a genuinely hard ticket or a loop. Either way, we want to know.

sum by (agent_id) (
  rate(agent_tokens_total{kind="output"}[5m]) * 15
  + rate(agent_tokens_total{kind="input"}[5m]) * 3
) / 1e6

Rates in dollars per second, with per-model pricing baked into the constants (we update those when the provider updates pricing). We page when the per-conversation average over a five-minute window exceeds 4× the seven-day baseline. Not an absolute threshold. The baseline drifts with feature work, and you want the alert to drift with it. The same panel also catches a quieter failure: an agent configured for Sonnet that silently routes to Opus because a fallback fired. Cost per conversation triples and the chart tells you why.

2. Per-tool error rate

The single biggest source of silent agent failure is a tool that suddenly returns garbage. Empty arrays instead of orders. 200 OK with an HTML error page in the body. A schema change the agent doesn't know about.

We instrument every tool call with both a counter and a histogram, tagged with tool and error_class. The panel breaks error rate out per tool, one line each, because a 6% error rate across all tools could be one tool at 60% and the rest at zero. You need to see that on the same screen.

sum by (tool) (rate(agent_tool_call_errors_total[5m]))
/
sum by (tool) (rate(agent_tool_call_total[5m]))

Alert threshold: any individual tool above 5% errors for ten minutes pages on-call. We tried looser thresholds. Every time we did, an external API went degraded and we found out from the customer.

3. p95 turn latency

Median latency is a comforting lie. People abandon agents on the long tail, not on the median.

histogram_quantile(0.95,
  sum by (le, agent_id) (
    rate(agent_turn_duration_seconds_bucket[5m])
  )
)

Threshold: chat agents page above 8 seconds at p95. Voice agents page above 1.2 seconds. Voice is unforgiving, and a two-second pause sounds like the line went dead. Both are five-minute sliding windows so a single slow request doesn't trip the page. If you are new to histogram quantiles in Prometheus, the official notes on bucket choice are worth twenty minutes before you ship the dashboard. Wrong buckets and your p95 is fiction.

4. Step count per session

If you only watch one observability panel, watch this one. Step count is where loops live.

We emit a histogram of how many planning iterations each conversation needed before it resolved or handed off. A healthy distribution has a long, thin tail. An unhealthy one has a spike at the maximum step limit, which means a meaningful chunk of conversations are hitting the ceiling and giving up.

histogram_quantile(0.99,
  sum by (le) (rate(agent_session_steps_bucket[30m]))
)

Threshold: p99 above 12 steps pages. The interesting alert here is not the page. It is the chart. When the p99 climbs from 8 to 11 over a week, something is degrading. Usually a retrieval source went stale and the agent is searching harder for the same answer. Step count moves before recall, before latency, before cost. It is the earliest warning we have.

Warning

Don't ship cost as your only alert. We did this for our first three agents. Cost is a lagging indicator of every other problem. By the time the bill jumps, the customer has been talking to a broken agent for an hour.

5. Schema validation failure rate

Structured outputs fail in two ways. The model returns malformed JSON, or it returns valid JSON that doesn't match your schema. Both are recoverable if you retry. Both quietly inflate your cost and latency budgets if you don't measure them.

sum(rate(agent_schema_validation_failures_total[5m]))
/
sum(rate(agent_structured_outputs_total[5m]))

Threshold: above 2% over five minutes is unusual. Above 5% is almost always a prompt change or a schema change that landed without a backfill. We page on 5%. We chart 2% as a warning line so the on-call sees the climb before it pages.

6. Human handoff rate

This is the business signal. The engineering signals tell you the agent works. Handoff rate tells you the agent helps.

Every agent we ship has an explicit escape hatch, a function the model can call when it decides it cannot resolve the customer's problem. We count those calls and divide by completed conversations.

sum(rate(agent_handoff_total[1h]))
/
sum(rate(agent_conversation_completed_total[1h]))

Threshold: 2× the seven-day baseline over one hour. The interesting thing about this signal is that it goes up before quality goes down. The model starts handing off more aggressively maybe an hour or two before recall on the eval set starts looking ugly. Use that head start.

7. Prompt injection and suspicious input rate

Prompt injection is the failure mode you can't fully prompt your way out of. Input the model treats as instruction rather than data: instruction overrides, role hijacks, hidden-data exfiltration, encoded payloads. OWASP keeps a living catalogue of these failure modes in their LLM Top 10, and it is the right place to start picking what to detect.

Every agent we ship runs incoming user messages through a fast classifier, a small model trained on a corpus of known injection patterns. It emits a counter when it fires. We do not block on the classifier. We log, count, and alert.

sum by (agent_id, signal) (
  rate(agent_input_classifier_total{class="suspicious"}[5m])
)

Threshold: any sustained rate over baseline for thirty minutes pages the security rotation, not the engineering one. Different people, different runbook. The first thing they do is pull the raw inputs from logs and decide whether the traffic is one curious user or a coordinated probe.

What we deliberately don't measure

We don't track "accuracy" as a live observability metric. Accuracy is something you measure offline on a labelled eval set after every prompt or model change. Trying to compute it in production means sampling conversations, sending them to a judge model, and trusting the judge. Three more things that can break.

We also don't alert on token spend in absolute terms. We alert on cost per resolved conversation. A campaign that drives 10× the traffic should drive 10× the spend without paging anyone at three in the morning.

The dashboard order matters

Step count and per-tool error rate go top-left. That is where on-call looks first. Cost sits top-right, because cost is the question the CFO asks and you want it answered before they ask. Latency and handoff sit in the middle row. Schema failures and injection signals go bottom, because if either of those is moving, you are already triaging.

None of this observability work is exotic. It is Prometheus, Grafana, a counter on every meaningful function call, and a histogram on every duration. The discipline is wiring it before launch, not after the first 2 a.m. incident.

When we built the email-agent for a Dutch logistics customer earlier this year, step count was the signal that saved us: cost and latency looked fine for two days while p99 steps crept up, because a vendor had silently changed how their tracking API returned multi-leg shipments. We wire the same seven panels into every AI agent we ship.

Smallest thing to do today: open your existing agent dashboard. Count the panels. If you have fewer than seven, the missing ones are the ones that will wake you up.

Key takeaway

Cost is a lagging indicator. Step count is where loops show up first, which is why it goes top-left on every agent dashboard we ship.

FAQ

What's a sensible upper bound on step count for a production agent?

We cap at 15 with a hard kill at 20. The interesting threshold is p99 above 12, which usually means something has degraded and the agent is working harder for the same answer.

Do you record metrics on every conversation or sample?

Every conversation produces metrics. Only full OpenTelemetry traces are sampled (10% in steady state, 100% whenever an alert is firing). Counters are cheap. Full traces aren't.

Which alerting backend sits on top of Grafana?

Grafana Alerting routes to PagerDuty for engineering signals and to a separate Slack-driven rotation for security signals. Different urgency, different on-call rota, different runbooks.

How do you avoid alert fatigue with seven signals?

Page on the rate of change against baseline, not absolute thresholds. Review every page in a weekly meeting. An alert nobody acted on gets deleted or retuned that week.

ai agentsoperationsarchitecturetoolingautomation

Building something?

Start a project