AI agents

Coding agent throttling: a Haarlem agency's nine hours

A 22-person Haarlem agency's coding agent slowed to a crawl for nine hours on a Friday in May. The throttle was upstream. The bill was theirs. Here is what we changed.

Jacob Molkenboer· Founder · A Brand New Company· 19 Jul 2025· 9 min

Brass relay switch beside a folded paper telegram and chartreuse sticky note on an ivory desk with leather blotter.

It is 09:14 on a Friday in May. A 22-person product agency in Haarlem opens its sprint board. Five engineers, three designers, two ops staff. The coding agent embedded in their IDEs has been running quietly for four months. By 10:30 the same agent takes forty seconds to finish a function that took six seconds yesterday. By 13:00 it has stopped completing larger refactors at all, but it still answers. It still feels alive. That is the worst kind of outage.

We were brought in the following Monday to write the post-mortem. This post walks through nine hours of silent degradation, what was actually happening upstream, and the three failover patterns we now build into every coding agent we ship.

First guess: the network

09:14, baseline. Engineer A finishes a Stripe webhook handler in roughly the time anyone would expect. Two diff suggestions, three iterations, done by coffee.

10:30. Engineer B opens a ticket in the agency's internal Linear. "Agent is slow today, is the office VPN doing something weird?" Three other engineers reply that it feels fine to them. They are looking at autocomplete, which has not degraded yet. The longer-context tasks have.

11:00. Engineer A's second refactor of the morning stalls. The IDE shows the spinner. After ninety seconds it returns half an answer. They retry. It returns the same half. They blame the branch.

11:30. Engineer C, working on a different repo, sees the same thing. A small chorus forms in the Slack channel. The senior engineer, head-down on a client demo, looks up and says "check the vendor status page." It is green. Everything is green.

12:00. Lunch. The IDE still works, but nobody trusts it. The team starts pasting code into a chat window instead of letting the agent finish. Same model. Different surface. It is slower than usual but it works. Hypothesis: the IDE plugin is broken. They will reinstall after lunch.

Six hours of false alarms

This is the part of the day that costs the most money. Not the throttle itself. The hunt.

By 14:00 the team has reinstalled two extensions, swapped to a different SSH key, restarted three Docker daemons, and rotated one API key. The senior engineer has read the vendor status page four times. Still green. They open a support ticket. The ticket gets an automated reply. They will not hear back until Tuesday.

15:00, the demo for a longstanding client. The lead engineer wants to walk through a clean refactor live. The agent stalls for seventy seconds before producing a response that has dropped half the input context. They pivot to slides.

16:00. The lead finally calls the throttle. They downgrade the team to the second-best model for the rest of the day. It is faster. It is also worse. Two engineers stop using the agent entirely and revert to typing by hand. The agency's internal pricing assumes a 28% velocity multiplier from the agent. That multiplier evaporates between 09:14 and 17:30.

The damage that Friday, before billing or rework, was nine hours times five engineers at a blended rate near 95 euro per hour. Roughly 4,275 euro of capacity, plus a client demo that did not ship, plus a sprint that slipped to the following Wednesday.

The upstream story

Some of this is public, some of it we pieced together with the agency's logs.

That week the team's primary model provider had quietly adjusted tier behavior on a slice of accounts. We never got a public explanation. Capacity pressure, a soft rollout of new throttles, a misconfigured load balancer, a noisy-neighbor account on the same shard. Pick one. What matters for an operations lead is not the cause. It is the shape of the change.

The throttle did not return 429 errors. It returned degraded responses. Slower time-to-first-token. Shorter effective context windows. Higher refusal rates on tool calls. A user could not tell from the IDE whether a slow response was an unlucky slow response or a quietly downshifted one.

The vendor's status page showed no incident. The vendor's published rate-limit documentation still listed the same tier numbers. The change was real. The signal was not.

Three failover patterns we now ship

After the Haarlem post-mortem we rewrote the routing layer in every coding agent and email agent we maintain. Three patterns. All cheap. All assume the upstream will lie at some point.

Active-passive routing with health beacons

The first pattern is a router that picks a provider based on observed health, not a config flag. A small in-process function records p95 latency and error rate over a rolling 60-second window. Every call updates the beacon. The router reads the beacon before each request.

// router/health.ts
type Provider = "anthropic" | "openrouter" | "openai" | "local";

interface Sample { ms: number; ok: boolean; at: number; }
const WINDOW_MS = 60_000;
const log = new Map();

export function record(p: Provider, ms: number, ok: boolean, now: number) {
  const arr = log.get(p) ?? [];
  arr.push({ ms, ok, at: now });
  while (arr.length && now - arr[0].at > WINDOW_MS) arr.shift();
  log.set(p, arr);
}

export function beacon(p: Provider) {
  const arr = log.get(p) ?? [];
  const ok  = arr.filter(s => s.ok).map(s => s.ms).sort((a, b) => a - b);
  const p95 = ok[Math.floor(ok.length * 0.95)] ?? 0;
  const errorRate = arr.length ? 1 - ok.length / arr.length : 0;
  return { provider: p, p95, errorRate, samples: arr.length };
}

The thresholds matter. We set p95 latency ceilings per task type, not per provider. A multi-file refactor on a Friday at 14:00 has a different latency expectation than a seven-line autocomplete. When the primary beacon trips the ceiling, the next provider in the list takes over for the next request, not the next minute. The IDE does not wait.

// router/route.ts
const CHAIN: Provider[] = ["anthropic", "openrouter", "openai", "local"];
const CEILINGS = { p95Ms: 8_000, errorRate: 0.10 };

export async function complete(prompt: string, now: number) {
  for (const p of CHAIN) {
    const b = beacon(p);
    if (b.samples < 3 || (b.p95 < CEILINGS.p95Ms && b.errorRate < CEILINGS.errorRate)) {
      const t0 = Date.now();
      try {
        const out = await call(p, prompt);
        record(p, Date.now() - t0, true, Date.now());
        return out;
      } catch (e) {
        record(p, Date.now() - t0, false, Date.now());
        continue;
      }
    }
  }
  throw new Error("all providers degraded");
}

In Haarlem's case, with this in place, the team would have flipped from the primary endpoint to a secondary at 10:42 instead of 17:30. The secondary was the same model family on a different host. They would have lost ninety minutes, not nine hours.

Per-task token budgets with graceful degradation

The second pattern is a budget per task type. Autocomplete gets a 2,000-token input budget and a small model. Refactors get 16,000 input tokens and a large model. Explain-this-code gets 8,000 and a medium model. The budget is enforced before the request is sent.

// budgets.ts
export const BUDGETS = {
  autocomplete: { inputTokens: 2_000,  outputTokens: 200,   tier: "small"  },
  explain:      { inputTokens: 8_000,  outputTokens: 800,   tier: "medium" },
  refactor:     { inputTokens: 16_000, outputTokens: 4_000, tier: "large"  },
} as const;

export function downshift(tier: "small" | "medium" | "large") {
  if (tier === "large")  return "medium";
  if (tier === "medium") return "small";
  return "small";
}

This sounds like premature optimization until you watch what happens during a throttle. When the large tier slows to a crawl, the router downshifts refactor jobs to the medium tier instead of failing them or timing out. The output is worse. It still ships. The engineer is in the loop, sees the downshift in a small status bar, and decides whether to accept.

We borrowed the idea from CDN failover, where a degraded experience beats a missing one. A reader still wants the article when the image CDN is down.

Local fallback for completion-grade tasks

The third pattern is the one most teams skip and then regret. For autocomplete-grade tasks, a small local model running on the engineer's laptop is good enough. Not for a refactor across eight files. For a thirty-token completion or a single-line suggestion, yes.

We ship every coding agent with a local model option preloaded. The router treats it as the last entry in the chain. It also runs on a one-second budget. If a remote provider has not responded inside that budget for an autocomplete, the local model takes over for the next call.

The Haarlem agency was running Apple Silicon laptops with 32 GB of RAM. A 7B-parameter coding model fit comfortably. None of their engineers had ever opened the local fallback before the Friday outage because they had never needed to. After the outage, the senior engineer set it as the default for the first thirty minutes of every workday so the team had verified it worked.

Takeaway

The provider you depend on most is the one most likely to fail invisibly. Build the fallback while you are not bleeding.

The Monday post-mortem

The full audit produced six action items. Two are worth naming here because every team we have looked at since has had the same gap.

First, no synthetic health check. The agency monitored their own production services with synthetic traffic. They did not monitor their developer tools. A two-line cron that fires a known prompt at the coding agent every five minutes and alerts on a p95 above five seconds would have flagged the throttle at 10:30, not 17:30.

#!/usr/bin/env bash
# every 5 min via cron, alert if p95 of last 10 runs > 5s
start=$(date +%s%3N)
curl -sf --max-time 30 "$CODING_AGENT_URL/complete" \
  -H "content-type: application/json" \
  -d '{"task":"healthcheck","prompt":"return the string OK"}' \
  -o /tmp/agent.out
end=$(date +%s%3N)
echo "$((end - start))" >> /var/log/agent-latency.log
tail -10 /var/log/agent-latency.log | sort -n | awk 'NR==9{ if ($1>5000) exit 1 }'

Second, no exit ramp for the team. Once the agent looked broken, every engineer made their own call about how to work around it. Some kept retrying. Some stopped using it entirely. Some pasted into a chat window. The shape of that decision belongs in a one-page runbook, not in five separate heads. The runbook fits on an index card: which provider goes next, which task type to suspend, who declares the all-clear.

The agency now runs both. The synthetic check is the twelve-line script above. The runbook is one paragraph pinned in the dev channel.

Where to start on Monday

The pattern is not hard. Most of the cost of an incident like Haarlem's is the gap between when the problem started and when someone named it. A five-minute health check that fires a known prompt and alerts when response time crosses a ceiling will buy you most of the way there. Add a secondary provider in the same chain. Pin a one-paragraph runbook in the dev channel. Build the rest in the weeks that follow.

When we built the failover layer for the Haarlem agency, the existing IDE plugin gave us no hook to swap providers per request, so we shipped a tiny local proxy that the plugin pointed at and let the proxy make every routing decision. That kind of plumbing is most of what we do inside our AI agents practice.

Key takeaway

The provider you depend on most is the one most likely to fail invisibly. Build the failover while you are not bleeding.

FAQ

How did the throttle look different from a normal slow response?

Slower time-to-first-token, shorter effective context, and higher refusal rates on tool calls. No HTTP error codes returned and no status-page incident posted, so the IDE looked alive.

What is a synthetic health check for a coding agent?

A small cron job that fires a known prompt at the coding agent every few minutes and alerts when the p95 response time crosses a ceiling. Twelve lines of bash is enough to start.

Is a local model fallback realistic on a normal laptop?

For autocomplete and single-line suggestions on Apple Silicon laptops with 32 GB of RAM, yes. For multi-file refactors, no. Treat it as the last tier in the chain, not a daily driver.

Why not just use multiple providers from day one?

Most teams should. The friction is usually the IDE plugin layer, which exposes one endpoint. We often ship a small local proxy that the plugin points at and put the routing logic inside the proxy.

ai agentstoolingarchitectureoperationscase study

Building something?

Start a project