Tooling

Agent observability: picking a stack at 4M spans a month

An operations lead pings you at 22:47: the agent's tool-call success rate dropped to 86% after dinner and nobody got paged. The stack you pick decides whether that ever happens again.

Jacob Molkenboer· Founder · A Brand New Company· 6 May 2025· 7 min

Brass pressure gauge on ivory blotter, chartreuse ribbon through bezel, folded paper tag with red ink mark beside it.

It is 22:47 on a Tuesday. An operations lead at a 28-person Dutch HR-tech firm pings us in Slack. Their procurement agent has run a single tool call 4.1 million times this week. Sometime after dinner the success rate dropped from 99.3% to 86%. Nobody got paged, because their agent observability is a Grafana dashboard that nobody opens after 18:00. The agent kept burning API credits until a finance bot flagged the spend at 22:30.

This is the boring failure mode of agent software. Not a jailbreak, not a hallucination. A retry loop that nobody is watching, in a system that produces millions of structured events a month. The question we walk into the next morning is always the same: what observability stack should we install so this never happens twice, given a team of four engineers and a real budget?

The three stacks on our shortlist

We have shipped fourteen production agents at this point. For a Dutch B2B SaaS doing between €2M and €20M in revenue, three options actually clear the bar:

Langfuse, self-hosted on Hetzner. Open source under the Langfuse EE split. Postgres for metadata, ClickHouse for traces since v3. Has prompts, evals, datasets, and a UI a non-engineer can read.
Arize Phoenix Cloud. Hosted version of the open-source Phoenix project. OpenTelemetry-native, OpenInference semantic conventions, evals built in. You pay per ingested span and per seat.
Hand-rolled OpenTelemetry collector into ClickHouse, with Grafana on top. No vendor in the loop. You own the schema, the dashboards, and every alert rule.

We have not seen LangSmith land for a Dutch buyer once in the last twelve months. The DPO conversation ends it before pricing matters.

Trace-replay cost at four million spans a month

Replay is the part most teams forget to price. When a prompt or tool definition changes, you want to re-run the last 30 days of production traces against the new version and compare outputs. At 4M spans a month, a 30-day window is around 16 to 24 GB of compressed trace data, depending on tool-call payload size. The cost is not storage. The cost is the model spend during replay, plus the cost of the system that can actually iterate over those spans fast enough to be useful.

The math, rounded to what we have actually paid:

Langfuse self-hosted. One Hetzner CCX33 (about €48 a month) handles 4M spans without breathing hard. Replay runs inside Langfuse against your prompt versions, so the only real cost is the model spend itself. Backups to a Hetzner Storage Box add about €5.
Phoenix Cloud. Span ingestion fees stack up quickly past the free tier. You also pay seats. Replay is well integrated but you cannot move the data out without paying for export. Budget €400 to €900 a month at 4M spans, depending on retention.
Hand-rolled OTel and ClickHouse. Same Hetzner box plus a second one for Grafana, around €70 a month. But replay is whatever you build. The first time someone needs to re-run 200k spans against a new prompt, you discover that nobody wrote that script yet. Allow two engineer-weeks for tooling that Langfuse ships out of the box.

DPO-friendliness

Every sub-€20M Dutch B2B SaaS we work with has a part-time DPO, sometimes a fractional one shared with three other startups. They are not adversarial. They are tired. The stack that requires the least explanation wins.

Self-hosted Langfuse on a Hetzner box in Falkenstein or Helsinki is a five-line addition to the existing data-processing register. The processor is the same one already used for Postgres. No new sub-processor, no new transfer impact assessment.

Phoenix Cloud runs on AWS, primarily US-east. That means an EU-US Data Privacy Framework justification and, for any customer doing payroll or health data, a separate Autoriteit Persoonsgegevens conversation that nobody wants to have. The vendor will sign a DPA. The customer's procurement still needs six weeks to approve it.

Hand-rolled wins on paper. It also forces you to be the one who deletes a user's trace history when an Article 17 request comes in. We have written that script three times now. Budget a day for it.

Warning

If your agent calls third-party APIs that return personal data inside tool responses (a CRM lookup, a payslip endpoint), those payloads end up in your trace store verbatim. Configure span redaction at the SDK layer, not in the dashboard. Once it lands in ClickHouse, it is in your backups for years.

Who builds the 92% alert

The alert that would have caught the Tuesday incident is simple to write and easy to forget. Tool-call success below 92% for five minutes, page someone. The interesting question is who owns it.

In Langfuse, you set it up in the UI in about ten minutes. Evaluators run on a schedule, a Slack webhook fires, done. An operations lead can maintain it without filing a ticket.

In Phoenix, evals are first-class but the alerting layer leans on you wiring up the cloud's evaluation hooks to PagerDuty or Opsgenie. Doable in a sprint. Not maintainable by anyone who is not an engineer.

In the hand-rolled stack, you write the rule yourself. The basic shape, once your OTel collector is exporting tool-call spans with a status attribute:

groups:
  - name: agent-tool-calls
    rules:
      - alert: AgentToolCallSuccessLow
        expr: |
          sum(rate(agent_tool_calls_total{status="ok"}[10m]))
            /
          sum(rate(agent_tool_calls_total[10m]))
            < 0.92
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Tool-call success rate below 92% for 5 minutes"
          runbook: "https://wiki.internal/runbooks/agent-toolcall-low"

That rule is twenty lines of YAML. The hidden cost is the runbook URL. The first time the alert fires at 02:00, the on-call engineer wants the trace IDs of the failing calls in the alert body. That requires templating the Grafana link with the right time window and the right filter. Allow a day for that polish.

The scoring sheet we actually use

We run this matrix on every agent project. Numbers are 1 to 5, higher is better for the buyer.

Criterion	Langfuse self-host	Phoenix Cloud	OTel + ClickHouse
Trace-replay cost at 4M spans	5	2	3
DPO friction (EU data, sub-processor count)	5	2	5
Time to first useful alert	4	3	2
Who can maintain it after we leave	4	3	2
Ceiling at 50M spans a month	3	4	5
Eval primitives included	5	5	1

The interesting row is the last one. If your evaluation logic is a panel of three model judges and a regex check, both Langfuse and Phoenix give you that out of the box. If your evaluation logic involves replaying a tool call against a sandbox of your own ERP, none of them help. You build that yourself in all three worlds.

What we usually pick

For a Dutch B2B SaaS in the €2M to €20M band, our default is Langfuse self-hosted on a Hetzner CCX33 in Helsinki, with backups to a Storage Box, and a SOPS-encrypted secrets file in the repo. We move them to a managed Postgres only when they hit Series A and the procurement story tightens up.

We move to the hand-rolled OTel and ClickHouse stack in two cases. One: span volume passes 20M a month and Langfuse's evaluator scheduling starts to lag. Two: the customer's data lives in a regulated environment (a Dutch zorgverzekeraar, a bank) where every dependency needs an internal security review and adding a Langfuse pod to the existing cluster is cheaper than adding a Langfuse vendor.

We have recommended Phoenix Cloud exactly twice in the last year. Both were teams with no platform engineer and a US parent company who already had an AWS DPA in place. Their constraint was not technology. It was that nobody on the team had ever SSHed into a server.

The failure modes of agents in 2026 are dominated by retries, loops, and silent degradation, not by exotic safety incidents. The team that catches them in time is the team that paged someone at the 92% threshold, not the team that ran a board-level red-team workshop in March.

When we built the procurement agent for a Dutch HR-tech client this spring, the thing we ran into was that the tool-call success rate looked fine in aggregate but was 60% on one customer tenant whose API quota had been silently halved. We ended up shipping per-tenant evaluators in Langfuse as part of an AI agents rollout, paged on a 90% floor, and the regression was caught the day it started.

The smallest thing you can do today: open your agent's most-called tool, write down the success rate over the last seven days from your existing logs, and decide which row of the matrix you sit in. Ten minutes once you have that one number.

Key takeaway

Default to Langfuse on a Hetzner box for sub-€20M Dutch SaaS. Move to OTel and ClickHouse only past 20M spans a month or strict regulator review.

FAQ

How long should I keep agent trace data?

Thirty to ninety days for replay, seven to fourteen for hot dashboards. Drop full tool-call payloads after fourteen days unless an active eval still needs them.

Does Langfuse self-host really need ClickHouse?

Since v3, yes. The single-box Docker compose ships ClickHouse alongside Postgres. At four million spans a month, one CCX33 handles it without sharding.

Can I migrate from Phoenix Cloud to self-host later?

Yes, but plan for it. Export costs apply, schema differences exist, and your prompt and eval objects move by hand. Easier to start self-hosted if you have the choice.

What is a sensible tool-call success alert threshold?

Start at 92% over a five-minute window. Tighten to 95% once you trust your eval set. Looser for tools known to be flaky, like third-party CRMs with rate limits.

ai agentstoolingarchitectureoperationsintegrationsstrategy

Building something?

Start a project