← Blog

Strategy

Local, hosted, or Sonnet: scoring a coding LLM stack

A scoring method we use to decide between a local Qwen3-Coder 30B box, a hosted DeepSeek V3 endpoint, and a Claude Sonnet 4.5 subscription, for a 14-engineer Dutch agency.

Jacob Molkenboer· Founder · A Brand New Company· 27 Oct 2025· 9 min
Three antique brass weights on cream paper tags with linen thread, one green wax seal, on ivory paper surface.

It is a Tuesday afternoon in late May. A sub-€15M Amsterdam agency you advise has had three things land on the same engineering-lead desk inside an hour. A CVE on the open-weights model behind their internal code-review bot. An AVG audit request from a client whose codebase touches health data. And a quarterly forecast where the Claude subscription line, billed per seat across 14 engineers, has just overtaken the AWS bill.

This post is the method we use, at our studio, to decide between a local Qwen3-Coder 30B box, a hosted DeepSeek V3 endpoint, and a Claude Sonnet 4.5 subscription per developer. It is opinionated. It is shaped by the size of the agency: small enough that there is no dedicated ML ops person, large enough that per-PR cost actually shows up on the P&L.

The three boxes on the table

Strip away the marketing and most agencies are choosing between three architectures.

Local box. Qwen3-Coder 30B (or whichever open-weights model has the best HumanEval score that week) on a workstation in the office, or a rack box at a Dutch colo. Inference runs on your GPUs. Prompts never leave your network.

Hosted endpoint. The same kind of open-weights model, but rented from a third party that hosts it. DeepSeek V3 on Fireworks, Together, or Groq. You pay per token. The vendor handles the GPUs and the infrastructure patches. Your prompts and completions cross their network.

Subscription. Claude Sonnet 4.5 (or whichever frontier model is current) on a per-seat plan or a per-API-key quota. The model is closed weights. Same data-flow shape as hosted, but the provider designed and trained the model and runs the only endpoint that serves it.

The right answer is rarely obvious in the room because the numbers and the risks live in three different mental models. Cost is per-token or per-seat. AVG is yes-or-no with footnotes. CVE response is a vibe. We force all three into one scorecard so the conversation can actually finish.

Cost per PR is the only honest unit

Tokens are not a budget unit. Seats are not a budget unit. The smallest piece of work that produces value at an agency is a merged pull request, so that is what we cost.

Our reference shop: 14 engineers, roughly 3 merged PRs per developer per week. Call it 170 PRs per week, 680 per month. Each PR runs through a review pass, an "explain this diff" pass, and occasional doc generation. We measure the actual token volume across a representative two-week window. For a typical front-and-back agency mix we see somewhere around 80k to 150k tokens of context per PR, with 5k to 20k of output.

At mid-2026 list prices, the math comes out roughly as follows. We round, but the orders of magnitude are real.

Per-PR cost (≈120k in / 10k out, list price, mid-2026)

Claude Sonnet 4.5 API     $0.36 + $0.15   ≈ $0.51 / PR
Claude Sonnet 4.5 sub     $20 seat / 49 PRs    ≈ $0.41 / PR
DeepSeek V3 hosted        $0.032 + $0.011 ≈ $0.043 / PR
Qwen3-Coder 30B local     amortised box   ≈ $0.02 to $0.06 / PR

For 680 PRs a month, that lands around $345 on Sonnet API, $280 on the seat plan, $30 on DeepSeek hosted, and somewhere between $15 and $40 on the local box once you amortise the capex over three years and add power.

So: hosted DeepSeek and local Qwen are 8 to 15 times cheaper than the subscription path. That is real money. But for a 14-dev agency the absolute difference is roughly €3,000 to €4,000 per year. Sonnet is not the expensive line. The expensive line is the engineer-hour the model burns when it loops.

AVG defensibility beyond a vendor checkbox

Every vendor will tell you they are "GDPR-compliant." What an Autoriteit Persoonsgegevens auditor actually asks is more specific. Where does the prompt physically land. Who is on the subprocessor list. What does the DPA say about training on customer data. Can you show me the logs.

For coding agents the awkward truth is that prompts are not "just code." They contain test fixtures with real customer names, database schemas that reveal business logic, environment variable references, and increasingly, the contents of Jira tickets stuffed into context by an IDE plugin. That is processing of personal data, and the agency is the processor.

The three boxes score very differently here.

  • Local box. Trivial. The data does not leave the network. The DPIA is short and boring. There is no subprocessor list to chase. Your AVG defence fits on a one-page diagram.
  • Hosted DeepSeek on a Western host. Workable. Fireworks, Together, Groq and similar will sign a DPA, name their subprocessors, and let you pin region. The model creator (DeepSeek, based in Hangzhou) is one layer back. Whether that defence holds depends on the client. Health-tech and government work, often no. Marketing-tech, usually yes.
  • Subscription. Anthropic publishes a DPA and a subprocessor list and lets you pick an EU-resident processing region. Their trust portal is one of the easier ones to send to a procurement team. The audit story is good. The "but you are sending our code to a US company" story is not different in shape from your existing GitHub footprint.

The shortcut: if your client is in health, legal, or defence, AVG pressure pushes you to local. Outside those verticals, the subscription path usually passes procurement with a DPA addendum and a region pin.

Patch ownership when a CVE lands on Tuesday

This is the question nobody asks until the first CVE actually lands. And one will land. Open-weights supply chains have already produced compromised checkpoints on Hugging Face, prompt-injection escalations through tool calls, and at least one case where a model card pointed at a different SHA than the served weights.

Patch ownership scales differently across the three options.

Subscription. Anthropic patches. You may not even hear about it. The endpoint changes, the model card gets a new version note, and your code-review bot is on the patched model by lunchtime. The trade-off: you do not get to say "no, I am not upgrading today, I am in the middle of a release."

Hosted endpoint. The hosting provider patches the infrastructure. Model-side patches depend on whoever publishes the open-weights checkpoint. The lag can be days. For a CVE on a serialisation library, you are usually fine. For a model-weights-level issue, you are riding the open-source clock.

Local box. You patch. On a Tuesday afternoon. The exact play is: your ops engineer reads the CVE record, decides whether it applies to your inference stack, pulls the new weights or library version, runs a regression suite against last week's PRs, and rolls forward. For a 14-engineer agency without a dedicated ML platform person, this is a real cost that does not show up in the per-PR math.

Warning

If your headcount math says "the most senior developer will handle CVE response," your real cost for the local option is whatever that developer would otherwise have shipped. It is rarely under €200 per hour of opportunity cost, and CVEs do not arrive on a schedule that respects your sprint.

The scorecard we hand the client

Six dimensions. Score each option 0 to 3. Multiply by the weight that this specific client actually cares about. The weighting matters more than the scoring.

Dimension                   Weight    Local   Hosted   Subscription
─────────────────────────────────────────────────────────────────────
Per-PR cost                    1         3        3         1
AVG defensibility              3         3        1         2
CVE patch ownership            2         0        2         3
p95 latency under load         1         2        2         3
Context-window ceiling         1         2        2         3
Vendor lock-in risk            1         3        2         1
─────────────────────────────────────────────────────────────────────
Weighted total                          19       16        19

The example weights above are typical for a Dutch agency serving regulated clients. The dimensions to argue about in the room are AVG defensibility and CVE ownership, because those are the axes where the cheap options are actually expensive.

The pattern that usually wins

After roughly thirty of these conversations in the last twelve months, the distribution of outcomes at a sub-€15M Dutch agency looks something like this.

Around 60% land on the subscription plan, almost always Claude Sonnet 4.5, because procurement is willing to sign once they see the DPA and because the Tuesday CVE risk is not theirs. The per-PR cost difference is real but it is a rounding error on the senior-engineer salary line.

Around 25% pick a local Qwen3-Coder 30B box, but always when the client has an AVG-sensitive vertical and is willing to fund the capex. The honest version of this option includes one named engineer on a "model on-call" rota and a documented patch runbook.

The remaining 15% end up hybrid. A local model handles the "read the codebase, give me embeddings, answer about the diff" path where context never leaves the network. A subscription model handles the "write me a migration plan from Drupal 7" path where the heavy reasoning is worth the seat cost. Routing happens in your control plane, not in the model.

The recurring "has anyone replaced Claude with a local model for daily coding" thread on Hacker News is worth reading, but read the comments more carefully than the upvoted top answers. The people who switched fully to local and stayed there almost all have a dedicated ML engineer, or are a solo developer with a Mac Studio at home. Neither of those matches the agency on the desk in front of you.

What this is not

This is not a benchmark post. We do not care which model scores half a percent higher on SWE-bench this month. Benchmark numbers move week to week, and your client's code is not in the benchmark anyway. The method above is durable because the cost, compliance, and ops axes do not move every week.

When we built a code-review agent for a Dutch marketing agency earlier this year, the thing we did not expect was how much of the decision came down to the patch-ownership question. We ended up shipping them a Claude Sonnet subscription pattern with a thin local retrieval layer for the AVG-sensitive client codebases, the same shape we use for our other AI agents, so the heavy reasoning stays on the subscription and the regulated context never leaves their VPC.

If you have not run the per-PR cost math for your own team, that is the five-minute audit worth doing today. Pull last month's merged PR count, divide your current model bill by that number, and look at the result honestly. Most of the time the number that comes out makes the rest of the decision obvious.

Key takeaway

Score the coding LLM choice on per-PR cost, AVG defensibility, and who patches the model on a Tuesday. The model in the column header changes. The columns do not.

FAQ

Is Qwen3-Coder 30B as good as Claude Sonnet 4.5 for code review?

Close on isolated benchmarks. Not close on long-context reasoning across a large codebase. For a 14-engineer agency, the gap shows up most in cross-file refactor suggestions and in following non-obvious instructions in a system prompt.

Can we sign a DPA with DeepSeek directly for AVG defence?

Not with DeepSeek the model creator in any practical way. You sign a DPA with whoever hosts the open-weights model, like Fireworks or Together, and inherit their subprocessor list. Whether that is acceptable to your client is a procurement question.

How much does the local-box option actually cost up front?

A workstation with an RTX 4090 plus enough memory to run Qwen3-Coder 30B comfortably lands around €5k to €7k. A rack box at a Dutch colo with two A6000s or an L40S is €12k to €20k. Power runs €300 to €600 per year at Dutch rates.

Does a hybrid local-plus-subscription setup actually work in practice?

Yes, and it is the pattern about 15% of our clients end up on. The local model handles retrieval and embedding over private code. The subscription handles heavy reasoning. The router lives in your control plane, not inside either model.

strategyai agentstoolingarchitectureoperationssecurity

Building something?

Start a project