AI agents

Local coding LLMs: how a vet chain killed its €1,840 bill

The pager went off at 02:14. The RTX 5090 in the broom cupboard had OOMed mid PR review and the on-call dev was on a train to Utrecht with eight pending reviews behind her.

Jacob Molkenboer· Founder · A Brand New Company· 17 Oct 2025· 8 min

Brass telephone switchboard with two cloth patch cables and a green index card with red wax seal on ivory desk.

The pager at 02:14

At 02:14 on a Tuesday in March, the on-call pager woke Lieke. The single RTX 5090 sitting in a converted broom cupboard at her employer's Nijmegen HQ had OOMed mid PR review. She is one of three developers maintaining the Animana integrations for a chain of nine veterinary clinics across Gelderland. The runbook said: kill the container, wait for the watchdog, re-queue. It took four minutes. She filed an incident note from a train platform and went back to sleep.

Six months earlier those eight pending reviews would have been served by Claude. Nobody would have woken up. The arithmetic of why this clinic chain decided that waking up was the better trade is what follows.

A €1,840 line item

The clinic chain has 22 employees. Three are developers. The dev team owns a roughly 140,000-line PHP and TypeScript codebase that wraps the Animana practice-management system from IDEXX with a custom booking front end, a SOAP-to-REST shim for lab results, an invoicing path, and a stack of cron jobs that reconcile no-shows against the calendar. The Animana side is eleven years old. Some of the cron jobs are older.

From October 2025 they had been running an LLM as a code-review assistant on every PR plus an interactive sidecar in their editors. Their monthly Claude spend stabilised at €1,840. For a chain whose total IT budget runs about €11,000 per month including loaded developer salary, that was roughly 17% of the line. The bookkeeper raised it twice. The dev lead promised to look at it.

"Looking at it" turned into a procurement question the afternoon the front page of Hacker News surfaced an Ask HN about replacing Claude with local models for daily coding, and one of the developers forwarded it on the team Slack.

The hardware shortlist

The shortlist came down to three options. A second-hand pair of RTX 3090s on a workstation board. A single RTX 5090 with 32 GB of GDDR7 on a Threadripper box. Or a rack-mount H100 in a Frankfurt colo. The H100 was twice the cost of a year of Claude. It was deleted from the spreadsheet within ten minutes.

The 3090 pair was cheap and would have run a 30B model at decent throughput. It was also two consumer GPUs in a metal case in a building that occasionally loses A/C in summer. They went with the 5090 for two reasons: a single-card thermal envelope they could trust the broom cupboard to handle, and enough VRAM headroom to fit a 30B model at 4-bit with comfortable KV cache for the team's longest review prompts.

Total spend, including the box, the GPU, a small UPS, and a one-off afternoon of electrician work to get a dedicated 16 A circuit into the cupboard, was just under €4,900. Payback against the Claude line was 2.7 months on paper.

Why Qwen3-Coder 30B, and not something larger

They auditioned three open-weights candidates against a sample of fifty real PRs from the previous quarter. The candidates were a 70B general-purpose model at 4-bit, a 30B code-specialised model, and a 14B model at full precision. The metric was: would a senior engineer accept the review as useful, with no false flags on Animana-specific patterns the model could not have seen during training.

The 70B model was the most articulate. It was also the slowest and missed the highest number of Animana idioms. The 14B model was fast and confidently wrong about SOAP envelopes. The 30B code-specialised model, in this case Qwen3-Coder 30B, sat in the place that mattered: it read the legacy PHP without complaining, it caught the kind of off-by-one cron mistakes the team actually makes, and it ran fast enough on the 5090 that the round trip from "open PR" to "first comment" stayed under fifteen seconds for typical diffs.

It is not the best coding model in the world. Across all of our work we still reach for the frontier APIs on the harder architecture conversations. For the eight to twenty PR reviews this team ships per day against a stable legacy surface, the 30B was good enough that they stopped noticing the difference inside a fortnight.

Per-PR review latency, measured honestly

We tracked four numbers per PR for ninety days before and after the switch.

Time from PR open to first model comment, cold cache.
Time from PR open to first model comment, warm cache.
Number of comments the human reviewer agreed with.
Number of comments the human reviewer marked as noise.

The headline result is dull, which is the point. Cold-cache p50 latency went from 6.1 s on the API to 11.4 s on local. Warm cache went from 3.2 s to 4.8 s. Useful comments per PR moved from 4.7 to 4.1. Noise comments per PR moved from 0.6 to 0.9. The dev lead's read of the numbers was that the local model added about five seconds of staring-at-the-tab time per PR and one extra "no, that is fine" click per PR. In exchange they got their €1,840 back.

Takeaway

The right local-model question is not "is it as smart as the frontier?" It is "is it good enough that the team stops noticing within a fortnight?" Measure both numbers; trust the second one.

The 02:00 wedge, and the runbook that came out of it

In the first six weeks the box wedged three times. Twice on a long PR that overran the configured context, once on what looked like a real driver hang. The fix in all three cases was the same: restart the inference container under systemd, wait for the watchdog to mark it healthy, re-queue the pending reviews. The CI pipeline was already idempotent so re-queueing was free.

The runbook fits on one printed page. The unit file is the load-bearing part.

# /etc/systemd/system/qwen-coder.service
[Unit]
Description=Qwen3-Coder 30B local inference (vLLM)
After=network-online.target

[Service]
ExecStart=/usr/local/bin/vllm serve Qwen/Qwen3-Coder-30B \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --served-model-name qwen3-coder
Restart=on-failure
RestartSec=15
WatchdogSec=120
TimeoutStopSec=30
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target

Two details worth stealing. The 120-second watchdog catches a stuck NVIDIA driver before the developers do. The 92% memory utilisation leaves enough VRAM headroom for the prefix cache without flirting with OOM on the longest prompts. They learned the second one the hard way after Lieke's 02:14.

The AVG audit log that survives a visitatie

Veterinary clinics in the Netherlands are inspected by the KNMvD on a roughly annual cadence, and any practice that touches client data has to be able to explain its data flows to the Autoriteit Persoonsgegevens under AVG. The clinic chain's data protection officer had been politely sceptical of the Claude line item from day one. Sending patient-adjacent debugging context to an American API was not technically a breach, but it was the kind of thing the DPO did not want to have to explain in a visitatie.

Going local turned that conversation around. Every inference request the box answers is logged to an append-only file on a separate disk, rotated daily, and shipped to cold storage every Sunday. Each entry is one JSON line.

{
  "ts": "2026-03-11T02:14:07Z",
  "request_id": "pr-2841-review-3",
  "actor": "ci-bot@clinic.local",
  "model": "qwen3-coder-30b",
  "input_sha256": "7af9c1...e22b",
  "patient_data_detected": false,
  "redactions_applied": [],
  "tokens_in": 4812,
  "tokens_out": 1166,
  "latency_ms": 8740,
  "exit": "ok"
}

Two things matter here. The hash of the input means they can prove, after the fact, what was sent without storing the prompt itself. The patient_data_detected flag is the output of a small regex-and-NER pass that runs before the prompt hits the model; any positive hit gets the matching span redacted and the redaction recorded. During the May 2026 visitatie the inspector asked for one week of logs picked at random. The DPO exported the JSONL, ran a one-line jq over it to show the redaction rate, and the conversation moved on to whether the chemical storage cabinet was locked.

What still goes to the API

The team kept a Claude budget of €120 per month for what they call "the hard ones". Architecture conversations that need a model that has read more of the world. The occasional one-off when the local box is being patched. Two of the three developers also pay personally for a chat subscription, which is none of the company's business.

The line item is no longer raised in finance meetings. The dev lead has stopped getting asked about it, which is the truest measure of whether a piece of internal tooling is working.

When we built the local-inference rollout for this clinic chain, the thing we kept hitting was the redaction step: it had to be fast enough not to double per-PR latency but thorough enough to satisfy the DPO. We solved it with a small NER model on the CPU plus a regex pass for Animana-specific patient ID formats, which is the kind of legacy-aware glue most of our AI agents work boils down to.

The smallest thing you can do this week

Before you cost out a GPU, run a fortnight of telemetry. Log every prompt your team sends to a hosted API, hash it, count the tokens, mark the ones that touched data your DPO would care about. At the end of the fortnight you will know two things you do not currently know: how much of your spend is on work a 30B local model could plausibly do, and how much is on the genuinely hard problems that justify the API line item. The decision becomes obvious from the histogram.

Key takeaway

A €1,840 monthly Claude line became a €4,900 one-off box and three wedges in six weeks. Measure honestly before you switch; keep an API budget for the hard problems.

FAQ

How much VRAM do you actually need to run Qwen3-Coder 30B?

At 4-bit quantisation a 30B coder model fits in about 20 GB with room for prompt cache. The RTX 5090's 32 GB lets you keep long contexts warm without tuning.

Is sending source code to a hosted LLM a GDPR/AVG problem?

Not by itself. It becomes a problem when prompts contain personal data and your DPO cannot show what was sent. A hashed, append-only audit log changes the conversation.

What does the box do when nobody is reviewing PRs?

Nothing. It idles. Power draw at idle is around 30 W on the 5090. No queue, no inference, no log entries, no heat in the cupboard.

Did the team miss Claude after the switch?

Not for routine PR review, within two weeks. They did for deeper architecture work, which is why they kept a small monthly API budget for the hard problems.

What broke first, and how was it fixed?

The box OOMed on a long PR that overran the configured context. The fix was to lower GPU memory utilisation to 92% and let the systemd watchdog restart the container.

ai agentscase studytoolingoperationsarchitecture

Building something?

Start a project