Case study

LLM provider routing: the day a demo froze for 40 minutes

A live demo froze for forty minutes when our client's rate limits compressed. Here is how we rebuilt the KYC summariser onto a four-route LLM fallback chain.

Jacob Molkenboer· Founder · A Brand New Company· 28 Sept 2025· 9 min

Four brass railway levers on linen desk blotter, one with green tag, pocket watch face-down, wax-sealed envelope.

11:40 on a Tuesday morning. The Rotterdam team is forty minutes into a procurement call with a Belgian retail bank. Three people on the prospect side, two on theirs, and the product is doing what it always does. It pulls a passport scan, two utility bills and a transaction CSV into a structured KYC summary in under nine seconds. Then the screen goes still.

Their CTO refreshes. Nothing. The progress bar sits at 12%. The prospect's compliance officer makes a polite joke about the demo gods. Forty minutes later, the call ends with "send us a recording later." The CTO opens Anthropic's status page and sees the green dot. He opens his own logs and sees the truth: every call to claude-sonnet-4 has been returning 429s since 11:38.

They are a seventeen-person compliance-tech SaaS. Twelve thousand KYC summarisation calls a week. One LLM provider. They had been on direct Anthropic API for fourteen months and had never lost a demo to it. Until now.

The morning of the capacity cliff

You can have an opinion about the cause. Anthropic ships an increasing share of its capacity through cloud partners now, with AWS Bedrock running cross-region inference for Claude models and Google Vertex AI running its own. The math on a direct-API-only architecture has been getting worse since late 2025, which is observable from the tier limits and the response headers, even if no vendor has put it in writing. We will not pretend to know which side of the meeting set the new numbers. What we know is this: a single-vendor LLM dependency, even on the best model on the market, is a procurement risk that compounds every quarter you ignore it.

The next morning, the founders called us. They had two options and they knew it. Negotiate higher direct-API limits and hope, or split the dependency across multiple providers and route. They picked the second. We had three weeks before the next demo cycle.

The router shape we landed on

Four routes. The same Claude family on three of them, a different model on the fourth for hard fallback.

Primary. Anthropic direct API, claude-sonnet-4. The fastest path when it is healthy.
Secondary. AWS Bedrock, claude-sonnet-4 with cross-region inference across eu-central-1 and us-east-1. Independent rate limits and an independent control plane.
Tertiary. Vertex AI in europe-west4, same Claude family, billed through the team's existing Google Workspace contract.
Hard fallback. Self-hosted Qwen3-32B on a single H100 in a Hetzner Helsinki rack, served by vLLM. Lower quality, lower latency floor, never rate-limited.

The router sits between the application and any of the four. It runs a circuit breaker on each route, a five-second budget per attempt, and a strict total chain budget of twenty-two seconds. The shape, in TypeScript:

// router.ts
type Route = "anthropic" | "bedrock" | "vertex" | "qwen3";

const CHAIN: Route[] = ["anthropic", "bedrock", "vertex", "qwen3"];

export async function summariseKyc(payload: KycPayload) {
  const errors: Partial<Record<Route, unknown>> = {};
  const started = performance.now();

  for (const route of CHAIN) {
    if (breaker[route].isOpen()) continue;
    if (performance.now() - started > 22_000) break;

    try {
      const res = await withTimeout(5_000, callRoute(route, payload));
      breaker[route].recordSuccess();
      metrics.record({ route, ok: true, ms: res.elapsed });
      return res.summary;
    } catch (err) {
      breaker[route].recordFailure(err);
      errors[route] = err;
      metrics.record({ route, ok: false, err: String(err) });
    }
  }
  throw new AllRoutesFailed(errors);
}

The whole router is two hundred lines. The hard part was never the router.

Scoring twelve thousand calls a week

For three weeks we mirror-ran every production call through all four routes in shadow mode. Production used the existing direct-API path. The shadow runner recorded latency, token cost and a deterministic quality score on each parallel response.

KYC outputs cannot be graded by a model. A model graded by a model is just a recursion with extra steps. The team built a strict grader: two hundred hand-labelled cases run nightly against each route, plus a structural validator that fails any output missing a required field or with a boolean outside its allowed enum. Anything that drifted more than two standard deviations below the production baseline auto-paged the on-call.

Across roughly thirty-six thousand shadow calls per route, the numbers settled into a stable shape:

Route	p50 latency	p95 latency	boolean accuracy	avg cost / call
Anthropic direct	2.1s	4.7s	99.4%	€0.018
Bedrock (cross-region)	2.6s	5.9s	99.3%	€0.016
Vertex (europe-west4)	3.1s	7.2s	99.1%	€0.017
Qwen3-32B (Helsinki)	1.4s	3.2s	96.8%	€0.004

These are the team's measurements on their specific KYC payload, not universal numbers. Your workload will move them. The shape, however, is consistent with every similar router rebuild we have shipped: cross-region Bedrock is roughly half a second slower than direct API at p50 and meaningfully cheaper at volume. Vertex is the slowest of the three Claude routes and the most variable, mostly because European traffic still gets routed to us-central1 a non-trivial share of the time.

The Qwen3 quality gap

The Helsinki box was the cheapest line in the chain by a factor of four, and the fastest when warm. It was also the worst at the four booleans that matter. On the two hundred hand-graded cases, Qwen3-32B agreed with the human reviewer on 96.8% of boolean fields. Claude on any of the three managed routes hit 99.1% or better. Three percentage points sounds small. On twelve thousand weekly calls, it is three hundred and sixty wrong decisions a week.

So we did not use Qwen3 as a quality-equivalent fallback. We used it as degraded mode. When the chain falls all the way through, the output gets wrapped in a "verify before approval" flag and pushed to the human review queue regardless of what the booleans say. The summary is still useful. It is just no longer trusted.

Warning

If you treat a smaller open-weights model as a drop-in fallback for a frontier model on a structured-output task, you will silently ship a quality regression. Mark degraded outputs explicitly and route them through a different downstream path. The router is the easy part. The contract on the output is the part that breaks.

The recurring Hacker News thread about replacing Claude or GPT with a local model for daily coding is interesting for the same reason. Most people who try it do not come back, and the ones who do usually mean "I switched the auto-complete to a 7B model and kept the heavy reasoning on the API." That is a router, not a replacement. The honest version of "go local" is almost always a tiered chain with a small clear job for the local model.

The prompt portability tax

The expensive part of the rebuild was not the router. It was the prompts.

The summariser was thirteen months of compounding prompt tweaks against Claude. Most of those tweaks were XML-style structural cues that Claude reads beautifully and Qwen3 reads as decoration. Vertex's Claude was nearly identical to direct API. Bedrock's was nearly identical, with two small differences in how stop sequences interact with structured output and how Bedrock surfaces tool-use errors. Qwen3 needed a separate prompt, a separate output parser, and a separate eval threshold.

A concrete example. The original prompt asked Claude for output in this shape:

<kyc_summary>
  <risk_flags>
    <politically_exposed>false</politically_exposed>
    <sanctions_hit>false</sanctions_hit>
  </risk_flags>
  <narrative>...</narrative>
</kyc_summary>

Claude obliged 99.9% of the time. Qwen3 frequently emitted the closing tag inside attributes or dropped it entirely on long outputs. We moved everything to JSON with a strict schema validator on the way out, and accepted that Claude was now about 1% slower at producing it. The router would not work without the schema. The schema would not work without the rewrite. That sequence took longer than the router, the IAM rotation and the Vertex onboarding combined.

The lesson is unglamorous. Every prompt you write to one provider becomes a private contract with that provider. The more idiosyncratic the prompt, the higher the switching cost.

What changed in production

Three things, in order of how much they mattered.

First, the router became the only path. No application code calls a provider directly. The router enforces the timeout budget, the circuit breaker, the eval contract on outputs, and the cost ledger. Removing the bypass took a week of grep and a stern code review.

Second, the team moved their billing relationships. Direct API stayed. Bedrock got an Enterprise Discount Program quote and a separate AWS account for the workload, which is the cheapest way to negotiate Bedrock rates at this volume. Vertex went on a committed-use discount through the existing Google Workspace contract. The Helsinki box was a one-off, depreciated over twenty-four months. Total infrastructure spend on LLM inference fell by about eighteen percent at the new blended mix.

Third, the demo environment got a kill switch. Before any sales call, the SDR can force the router into "Bedrock primary" mode for the next two hours. The reasoning is brutal but honest. If a tier compression hits Anthropic direct again, the demo will not be the place we find out.

The numbers six weeks in

Since the rebuild went live on 5 June, the chain has failed all the way through to Qwen3 exactly four times in seventy-three thousand four hundred calls. Three of those were a Vertex region outage that GCP eventually posted to its status page. One was a Bedrock IAM rotation we caused ourselves.

The blended cost per KYC summary dropped from €0.018 to €0.014. The p95 latency went up by 0.4 seconds, because the timeout budget on the first attempt now lives inside a chain, not on a single call. Nobody on the sales team has noticed the latency. Two prospects have asked, in their security review, whether the product has a single-vendor LLM dependency. The answer is now no, and the team can hand over a diagram.

If you are running anything in production against one LLM provider, spend an hour this week wiring a second route behind a feature flag. You will not need it most days. The day you do, you will already have the integration written and the eval runner pointed at it. When we built this for the Rotterdam team, the part that took the longest was not the router or the providers. It was getting the prompts to behave the same way across three model families. If that is where you are stuck on your own AI agents, we have done it enough times now to know which shortcuts are real.

Key takeaway

Single-vendor LLM dependency is a procurement risk. Build the router before the demo freezes, not after. The prompt portability work is what takes the time.

FAQ

Why not just negotiate higher direct-API limits with Anthropic?

They did, in parallel. The router is the complement, not the replacement. Higher limits help on a normal day. A multi-provider chain helps on the day the limits change without warning.

Did Qwen3 replace Claude for any production calls?

No. Qwen3 only handles cases that fail through all three Claude routes, and those outputs are flagged for mandatory human review regardless of what the model says.

How long did the full rebuild take?

Three weeks for the router, the shadow runner and the rollout. The prompt portability work to make outputs interchangeable across providers took longer than the routing layer itself.

What did blended cost per call do after the rebuild?

It dropped from €0.018 to €0.014. Bedrock EDP rates and Vertex committed-use are slightly cheaper than direct API at twelve thousand weekly calls, and the Qwen3 floor pulls the average down further.

Is a self-hosted model worth it for an LLM-dependent SaaS?

Only as a degraded-mode safety net, not a quality-equivalent fallback. Single H100 economics work if it is the third or fourth route in the chain, not the primary.

ai agentsarchitectureintegrationssaascase studyoperations

Building something?

Start a project