Strategy

LLM dependency audit: a six-line checklist for SaaS ops

After the rate-limit cliff hit a client mid-sprint, we wrote down the six-line audit we now run on every sub-€25M Dutch SaaS LLM dependency chain.

Jacob Molkenboer· Founder · A Brand New Company· 9 Aug 2025· 9 min

Brass six-tab card divider on bone paper blotter, cream index card with green ribbon, small red wax seal at base.

Tuesday, 14:40 Amsterdam time. The customer-success agent stops mid-reply to a paying user. Slack lights up. The Bedrock console shows green, but the model behind the curtain is returning 429s on three of four production routes. The ops lead pings the on-call backend engineer. The backend engineer is in a campervan somewhere north of Bergen, no signal until Friday.

This is the moment a vendor-chain audit pays for itself, or doesn't exist and costs you a day of revenue.

We run a six-line audit on every Dutch SaaS vendor under €25M before we ship an AI feature into their stack. It started as a Miro sticky-note checklist after a rate-limit cliff hit a client mid-sprint last quarter. It has since caught two near-outages, one quietly overpriced primary that no one had revisited in nine months, and one vendor whose "proprietary" model turned out to be a finetune of something we already paid for elsewhere.

Here is what is on the checklist, why each line matters, and the artifact we hand back to the ops lead at the end.

The cliff and why audits became urgent

If you missed it: Amazon's hosted offering of frontier models hit a rate-limit ceiling that a lot of mid-market SaaS apps had built quiet dependencies on. The signs were subtle. P95 latency on /chat routes crept from 1.8s to 4.2s. Structured-output endpoints started returning malformed JSON about 1 in 80 calls. Some teams patched it route by route, often without telling the rest of the company that the primary model had silently rotated to a smaller sibling.

The deeper lesson is not "have a fallback." Most teams have a fallback. The lesson is that most SaaS teams under €25M cannot tell you, in writing, today, which routes call which model, what the response schema looks like on each side, who is allowed to flip the switch, and what the per-route failure budget actually is. The audit closes that gap.

It does not need a model gateway, a vendor, or a procurement cycle. One folder in your repo and one afternoon.

Line one: provider surface count

We start by counting providers and surfaces. A surface is a distinct way to reach a model. The vendor's own API is one surface. A managed cloud reseller (AWS Bedrock, Google Vertex) is another. A local proxy is a third. An internal gateway with its own queue is a fourth.

Most clients answer "we use two providers" before the audit and "we use five surfaces" after. Five is fine. Five undocumented is not. We deliver a one-page table with provider, surface, model id, region, per-minute token ceiling, and a link to the actual vendor rate-limit docs the number came from. Numbers without source links rot inside a quarter.

The table also surfaces something teams rarely notice. Two of your surfaces almost certainly share a backend. If your primary is one vendor via direct API and your fallback is the same vendor via Bedrock, you have one provider with two front doors, not two providers. That is concentration risk in a costume.

Line two: per-route fallback latency

For every production route that calls a model, we measure two numbers: the median latency on the primary, and the median latency on the fallback path with a forced primary failure.

The forced-failure measurement is the one teams skip. It is also the one that catches the 9-second timeout cascade where the primary times out, the SDK retries twice with exponential backoff, and only then does the fallback kick in. The user has already closed the tab.

We run it with a small harness that injects a 503 on the primary and watches the route end-to-end:

#!/usr/bin/env bash
set -euo pipefail

ROUTE=$1
SAMPLES=${2:-50}

for i in $(seq 1 $SAMPLES); do
  curl -s -o /dev/null -w "%{time_total}\n" \
    -H "X-Force-Provider-Fail: primary" \
    -H "Content-Type: application/json" \
    -d @fixtures/$ROUTE.json \
    https://api.your-app.local/$ROUTE
done | sort -n | awk '{a[NR]=$1} END {
  print "p50:", a[int(NR*0.5)]
  print "p95:", a[int(NR*0.95)]
}'

The X-Force-Provider-Fail header is a development-only switch your gateway needs to honour. If it doesn't, that is finding number one. You cannot test a fallback path you can't trigger on demand.

Line three: schema drift between providers

If your route returns structured output, the audit asks one question. When you swap the primary, does the response bind to the same Pydantic, Zod, or JSON Schema definition, or does the integration layer translate?

The honest answer for most stacks is "translates, but the translation is a function nobody has tested in six months." Schema drift between providers is real and quiet. OpenAI's Structured Outputs guarantees a strict-schema mode that pegs the response to your exact spec. Anthropic's tool-use returns close, but uses different rules for optional fields and slightly different empty-array handling. Google's function-calling shape differs again on enums.

If your code path assumes one and gets another, you get silent corruption, not a 500. Silent corruption is worse than a 500. A 500 wakes someone up. A drifted field returns "Unknown" for a customer's country and no one notices for three weeks.

The audit artifact is a fixture file per route, with the canonical expected output and a small script that diffs each provider's actual response against the schema:

import json, sys, jsonschema
from pathlib import Path

schema = json.loads(Path(sys.argv[1]).read_text())
sample = json.loads(sys.stdin.read())

try:
    jsonschema.validate(sample, schema)
    print("ok")
except jsonschema.ValidationError as e:
    print(f"drift: {e.message}")
    print(f"at: {'.'.join(str(p) for p in e.path)}")
    sys.exit(1)

Pipe the live response from each provider into the script. The drift surfaces in under a minute per route. We run it monthly against five frozen fixtures per provider, which doubles as a cheap "did the model rotate without telling us" detector.

A real example from one of those checks two months ago. The same product-extraction prompt sent to the primary returned a JSON array of three line items. The same prompt against the fallback returned a single string that concatenated the three items with semicolons. The downstream consumer expected a list, so the second response silently produced an order with one fictitious product whose name was three real products glued together. A human noticed before our monitoring did, which is the wrong order of operations. The schema check would have caught it in five seconds.

Line four: swap permissions

This is the line that surprises clients the most. We ask: if your primary model rate-limits at 03:00 on a Saturday, who in your team can swap it without a backend engineer touching the repo?

The right answer is "the on-call ops lead, via a config change that takes effect in under 90 seconds." The common answer is "nobody, we'd need to ship a PR, run CI, and redeploy." The middle answer is "our CTO, but only she has the key, and she is in a meeting."

The fix is mechanical. A model-routing config lives in a key-value store, gated by a feature flag, with a small admin UI that shows the current primary, the fallback chain, and a one-click rotate-to-fallback button. Two days of work and one sober rule: ops can flip it, engineers own the canary checks afterward, and every flip writes a row to an audit log.

The audit log itself turns into a side-effect worth having. Every flip writes the timestamp, the ops lead's identity, before-and-after model ids, and the short reason text the operator typed. Three months in, you have a vendor-reliability dataset of your own. The next procurement conversation stops being about marketing claims and starts being about your own logs, which is a much shorter argument.

Takeaway

The vendor risk is not the model going down. The vendor risk is that swapping it is locked behind one person who isn't in the building.

Line five: failure-mode budget per route

Not every route deserves the same fallback machinery. A weekly digest email tolerates a six-minute retry loop. A live customer-success chat does not. An invoice classifier sitting in front of an accounting workflow tolerates a short rule-based fallback while the model recovers.

The audit asks the product owner, in writing, what the failure budget is per route, in seconds and in user-facing words. The artifact is a small YAML file that lives next to the routing config:

routes:
  chat.live:
    max_total_latency_s: 6
    fallback_chain: [primary, secondary, cached_canned_reply]
    user_message_on_full_failure: |
      We're catching up on a busy moment. Your message is saved
      and a teammate will pick it up within the hour.
  digest.weekly:
    max_total_latency_s: 360
    fallback_chain: [primary, secondary, queue_for_human]
    user_message_on_full_failure: null
  invoice.classify:
    max_total_latency_s: 12
    fallback_chain: [primary, secondary, rule_based]
    user_message_on_full_failure: null

Now the ops lead has a config they can read. The product owner has a budget they signed off on. The engineer has a contract. When something breaks at 03:00, the question is no longer "what is acceptable here" but "did we stay inside the budget the product owner agreed to."

Line six: vendor disclosure check

The last line came after a story that hit Hacker News this month: a city government's "homegrown" municipal LLM that, on closer inspection, looked a great deal like a published merge of an existing open-weights checkpoint. We do not assume malice. We do assume vendors describe themselves optimistically, and that the gap between marketing copy and the actual model is where the surprises live.

The check is two questions. One: does the vendor publish, in writing, which model family powers each endpoint, and does it match what they market? Two: have they had a model rotation in the last six months that they did not announce?

You can answer the second question by hashing five fixture outputs per route and re-running them monthly. If the hash changes and there was no release note, you have a quiet rotation. That is not always bad. It is always a thing to know.

If a vendor calls itself "proprietary" and benchmarks match a known open-weights checkpoint to four decimal places, that is not a deal-breaker. It is a thing to know before you sign, and especially before you build a route that assumes a particular reasoning style or refusal pattern.

Running the audit in an afternoon

The full artifact for a sub-€25M SaaS fits in one repo, one folder:

llm-audit/
  providers.md           # line one: surface table
  fallback-bench.sh      # line two: latency harness
  fixtures/
    chat.live.json
    digest.weekly.json
    invoice.classify.json
  schema/
    chat.live.schema.json
    digest.weekly.schema.json
    invoice.classify.schema.json
  validate-drift.py      # line three: drift checker
  routing.yaml           # lines four and five: who + budget
  vendor-disclosure.md   # line six: vendor claims vs reality
  run.sh                 # ties them together

An ops lead can read every file. An engineer can extend every script. The audit is not a Notion page that ages into untruth. It is code that runs on Monday and Monday after that.

We re-run the full pass once per quarter, and a smaller cadence weekly: the drift checker against frozen fixtures, the surface table against the live config. Anything more frequent stops being audit and starts being monitoring, which is a different folder.

One client, one cliff, one weekend

When we built the inbox-triage agent for a Dutch logistics SaaS around €8M ARR, the thing that hit mid-sprint was exactly the rate-limit cliff, and their fallback config lived in a YAML file only their lead engineer could deploy. We wrote the first version of this checklist that weekend, shipped the routing config to a key-value store on Monday, and trained their ops lead to swap models from a small admin panel by Wednesday: the audit format above is the version we now reuse for every AI agent build.

The total work, including our time and two afternoons of their ops lead's attention, cost them roughly what a single internal all-hands day would have. The next time the primary cliffed, three weeks later, the swap took 70 seconds, the ops lead handled it without paging the engineer in the campervan, and the only Slack noise was one line in the routing audit log. That is the shape of a working audit. It is boring on the day it earns its keep.

The five-minute version you can do today, before anything else: open a new doc, list every production route in your app that calls a model, and write the provider and model id next to each. If you cannot fill in any cell, that is your line one finding, and the rest of the audit follows from there.

Key takeaway

The vendor risk is not the model going down. The vendor risk is that swapping it is locked behind one person who isn't in the building.

FAQ

How often should we re-run the full audit?

Once per quarter on a clean afternoon, plus any time you add a new model or route. Run the drift checker monthly against frozen fixtures. More often than that and it stops being audit and starts being monitoring.

Do we need a model gateway to do this?

Not strictly. A YAML config in a key-value store works fine under ten routes. Above that, a gateway pays for itself in observability and the per-route fail-injection switch you need for line two.

What if our vendor will not disclose which model powers each endpoint?

That is a finding, not always a deal-breaker. Combine it with no SLA and a single billing surface and you have concentration risk you did not price into the contract.

Can an ops lead really swap a primary model safely?

Yes, if the swap is a config change gated by a feature flag with a 60-second auto-rollback and a canary check. Engineers own the canary; ops own the lever and the audit log.

ai agentsstrategyarchitectureoperationstoolingintegrations

Building something?

Start a project