Voice agents
Voice agent incident: 1,140 stale tariff quotes at 13:00
Wednesday, 13:04 CET. A voice agent on a Groningen energy supplier's SIP trunk had just quoted the wrong variabel tarief to 1,140 huishoudens in 41 minutes.

It was 13:04 on a Wednesday in February. The shift supervisor at a 24-person energieleverancier in Groningen watched the call dashboard tick up. 1,140 outbound voice-agent calls completed in the last 41 minutes, every one of them confirming a variabel-tariff offer to a household whose contract was up for renewal. She had a coffee in one hand and a knot in her stomach, because the EPEX day-ahead price-feed had just refreshed at 13:00 and the new ceiling was 11 cents higher than the figure the agent had been quoting since 12:23.
The agent was ours. The 90-second TTL on the cached feed was ours. The Claude tool-use loop that re-fetched the cache instead of invalidating it was ours. This is the post-mortem.
The pipeline before the incident
We had built the supplier a Dutch-language outbound voice agent six months earlier. Twilio SIP trunk, our own ASR + TTS routing, Claude Sonnet 4.5 as the brain, and a small toolset for tariff lookup, contract verification, and hand-off to a human if the customer said anything that smelled like a complaint. Outbound load was modest: roughly 300 calls a day, mostly renewal nudges for contracts hitting their twelve-month mark.
The tariff lookup was the only tool that mattered. The customer would ask, in some form, "wat ga ik betalen". The agent would call get_current_tariff(postal_code, contract_type). That tool hit our internal pricing service, which itself read from a Redis cache populated by a worker that polled the EPEX SPOT day-ahead feed every minute.
Cache TTL: 90 seconds.
Why 90 seconds? Because the upstream feed was rate-limited to 60 requests per minute across the whole supplier, and three of their sister products shared the same quota. Ninety seconds gave us breathing room and meant a customer would never hear a price more than 90 seconds stale. We had been running this way for five months without a complaint.
What happens at 13:00 in the Dutch power market
Anyone who works in Dutch energy knows the rhythm. The EPEX day-ahead auction closes at 12:00 CET. Clearing prices for the next 24 hours are published shortly after. Until those prices land, suppliers run on the previous day's curve. Once they land, usually between 12:45 and 13:05, the curve shifts and any variabel-tariff offer needs to reflect the new band.
The ACM, the Dutch competition regulator, has guidance on how clearly and how promptly that shift must be reflected in customer-facing offers. It does not prescribe a millisecond SLA, but the spirit is plain: if you quote a price, it should be the price you can actually deliver at that moment.
Our voice agent was quoting a price from 12:23.
The mechanics of the failure
Three things compounded.
First, the price-feed worker had been logging a stream of 429 Too Many Requests from the EPEX endpoint between 12:55 and 13:02. The supplier's marketing team had kicked off a daily reporting job that hammered the same upstream. The worker retried, hit the per-supplier quota ceiling, and skipped its 13:00 and 13:01 refresh windows.
Second, the Redis cache key was scoped per postal-code-band, not per publication-window. The 12:23 fetch had populated 412 distinct keys, one per active postal band, and each of those keys had its own 90-second TTL ticking independently. As outbound calls landed and the agent looked up tariffs for different postal codes, the agent was getting cache hits on whichever keys were still warm. The warm ones were extending themselves because of an aggressive read-through-refresh pattern we had inherited from a sister service.
Third, and this is the one that hurts, the Claude tool-use loop was misbehaving in a way we had not anticipated. The agent's system prompt said, in effect, "if the tariff feels uncertain, verify by calling get_current_tariff again". When EPEX rate-limit warnings briefly leaked into the tool response payload (we surfaced upstream metadata for debugging and forgot to strip it on the production path), the model re-called the tool. The re-call hit the same warm cache. The model interpreted "same answer twice" as "verified" and committed to quoting it on the line.
By the time the supervisor saw the dashboard, 1,140 households had been quoted a tariff that was 11.2 c/kWh under the post-13:00 band on average. The economic exposure, if every household took the offer, sat north of €380,000 over the contract year. They didn't all take it. About 7% did. The rest opted to call back later or were forwarded to a human.
The first hour
We killed outbound tariff calls at 13:07. Three minutes after the supervisor flagged it, by triggering the kill-switch on the SIP trunk that we had built precisely for this kind of moment. We had used it exactly once before, to test it. Every customer who had been quoted the stale tariff got a callback the next day from a human, with an apology and the corrected price. The supplier honoured the lower of the two prices for any household that had already verbally accepted. That was about 78 households, and the cost was absorbed as a goodwill gesture.
The kill-switch worked. The detection didn't. Three minutes is a long time when you are billing 28 calls a minute.
The freshness-check gate
The fix is not "lower the TTL". A 30-second TTL would have reduced exposure but not eliminated it. The upstream rate-limit was the harder constraint. The fix was to stop trusting the cache for tariff-bearing outbound calls, and instead require a positive freshness assertion before any sentence containing a price leaves the SIP trunk.
We built a small middleware layer between the LLM's tool-call response and the TTS engine. Before the synthesizer reads a number that the response classifier flags as a tariff figure, the layer makes a synchronous call to a tariff_freshness endpoint. That endpoint returns three fields: the cache age in seconds, the most recent EPEX publication timestamp seen by the worker, and a boolean safe_to_quote.
The boolean is the gate. If safe_to_quote is false, the synthesizer reads a fallback line instead: "Mag ik u een ogenblik later terugbellen met de actuele prijs?". May I call you back in a moment with the current price.
async def gate_tariff_speech(utterance: str, tariff_value: float | None) -> str:
if tariff_value is None:
return utterance
freshness = await freshness_client.check(
tariff_value=tariff_value,
max_cache_age_s=45,
require_post_publication_window=True,
)
if not freshness.safe_to_quote:
logger.warning(
"tariff.quote.gated",
cache_age_s=freshness.cache_age_s,
last_publication=freshness.last_publication_ts,
reason=freshness.reason,
)
return FALLBACK_DUTCH_CALLBACK_LINE
return utterance
The require_post_publication_window=True flag is the part that catches a 13:00 EPEX rollover. The endpoint refuses to clear a quote between 12:55 and 13:10 unless it can confirm the worker has ingested a publication-stamped EPEX response inside that window. If it cannot, the gate trips, the agent says the callback line, and the call drops the tariff segment cleanly.
A cache that can extend its own TTL is a cache that lies during the exact windows you most need it to tell the truth. Read-through-refresh is fine for a product catalogue. It is not fine for anything regulated.
Three runbook changes that stuck
The upstream rate-limit budget moved off the shared supplier quota and onto a dedicated key for the voice agent. This cost the supplier €180 a month and bought guaranteed headroom during publication windows. Cheap.
The Claude tool-use loop got a stricter contract on what counts as "verified". We added a verification_token returned by the freshness service that ties a quote to a specific cache fetch and publication window. The model is instructed not to treat a re-call as confirmation unless the token matches. The upstream metadata leak was patched. Production tool responses now strip everything that is not a price and a freshness payload.
The 13:00 window is now a maintenance window. Outbound calls are paused between 12:55 and 13:08 by default. The supplier's marketing team agreed without much argument once we showed them what the 41 minutes had cost.
The moment of truth is the moment of speech
The freshness-check middleware added roughly 80ms of synchronous latency to any utterance containing a tariff. The agent's average response latency went from 1.4s to 1.5s. Customers do not notice. The supplier ran three weeks of post-fix calls before the next EPEX rollover. There has not been another incident.
The generalisable bit is this: voice agents that quote regulated numbers need to treat the moment of speech as the moment of truth. Not the moment of tool-call. Not the moment of LLM completion. The moment the SIP trunk turns text into audio is the last point at which you can refuse to lie. Build the gate there.
When we built the voice agent for this energieleverancier, the failure class we ran into was one that does not show up in any unit test. A quiet, slow drift between cache state and reality, weaponised by a polite LLM that took the cache at its word. We ended up solving it by making the speech path the audit boundary. If you want that kind of gate on your own outbound calls, it is the work our AI agents practice does.
The five-minute audit
If you run any outbound voice or chat agent that quotes a regulated number, grep your tool responses for upstream metadata that should never reach the model. Then grep your system prompt for the phrase "verify by calling again". Those two findings cover the failure mode in this post. If you have either, you have a 13:04 in your future.
Key takeaway
Voice agents that quote regulated numbers should treat the synthesizer, not the tool call, as the audit boundary. That is the last point where you can refuse to lie.
FAQ
Why not just lower the cache TTL to a few seconds?
Because the upstream EPEX feed was rate-limited and the cache existed to respect that limit. A shorter TTL would have multiplied 429s, not eliminated stale quotes. The gate has to live at the speech path, not in the cache.
Why put the freshness check between the LLM and the TTS engine instead of inside the tool?
Because the LLM can re-call a tool, reason over its output, and still commit to a stale number. Gating at the synthesizer is the last layer the model cannot route around. It is the only point where speech is final.
Did the model actually believe a duplicated cache hit counted as verification?
Effectively yes. Our prompt invited a re-call on uncertainty and our tool response did not include a verification token tying the answer to a publication window. Same payload twice read as confirmation. We fixed both sides.
How did you stop outbound calls in three minutes?
A pre-built kill-switch on the SIP trunk that drops new outbound legs and lets in-flight calls complete. We had built it on day one and tested it once. Build yours before you need it.