Strategy
AI compute for Dutch SaaS: a three-path scoring method
It's a Tuesday morning in Utrecht. Your inference bill was €41k last month. HN says self-host. Your CFO says cut. Here is the method we use to decide.

It's a Tuesday morning in Utrecht. You run a B2B SaaS doing €4.2M ARR, your inference bill last month was €41,000, and someone on your engineering Slack just dropped a link to the HN thread about the AI Compute Extensions specification. The thread is 380 comments deep. Half of it is people saying per-token billing is a trap and you should self-host a Llama derivative on a box in Falkenstein. The other half is people saying that's a fantasy and you should sign an enterprise contract with Anthropic before quotas tighten. Your CTO wants your read by Thursday.
This decision has three paths and one weekend in December that breaks two of them. Here is the method we use with clients to score it.
The three paths, stripped of marketing
Pay-per-token API. You keep doing what you're doing — call OpenAI, Anthropic, Google, or some mix, get billed for every input and output token, and your CFO sees one line item that scales linearly with usage.
Enterprise contract. You commit to twelve or twenty-four months of a minimum spend with a frontier-model vendor in exchange for committed throughput, a zero-data-retention rider, EU region routing, and a named technical account manager who answers email on Saturday.
Self-host. You rent a GPU box, pull down a Llama, Mistral, or DeepSeek checkpoint, run it behind vLLM or TGI, and own the whole stack from kernel to inference endpoint. The cloud-egress line on your invoice goes to zero. The on-call rotation gets one more pageable service.
Each one looks attractive on a whiteboard and brutal in production. The method below is how we decide between them for a specific revenue and call-volume profile.
The cost math at 12.4M calls per month
Round numbers, because real ones depend on your prompt shape, but useful as a frame. At 12.4M monthly calls with an average of 600 input tokens and 250 output tokens — typical for an inbox-triage or invoice-classifier workload — you're moving roughly 7.4B input and 3.1B output tokens a month.
At Sonnet-class list price (~$3/MTok in, $15/MTok out at the time of writing), that's about $22k plus $46k, so roughly $68k per month, or €63k. A Haiku-class model drops that by 4–5x for many of those calls. GPT-4o-mini and Gemini Flash sit in a similar neighbourhood.
An enterprise commit typically lands you 15–35% off list, plus throughput guarantees and an EU rider. Call it €45k per month with a 12-month minimum: €540k committed.
Self-hosting a 70B-class open-weights model on a single GPU box big enough to serve your peak QPS — think a Hetzner GEX or a Lambda Labs node with an H100 or two — runs you somewhere between €1,800 and €4,500 per month in hardware rental, plus your ops time. If your model fits, this is genuinely cheaper. If it doesn't, you're buying a second box and a load balancer and the spreadsheet starts to look different.
The self-hosting cost model only holds while you're at steady state. A 3x traffic spike on a Tuesday afternoon costs you nothing on the API; on your own box it costs you a 504.
Where the bytes actually sit
The AVG question — the Dutch implementation of the GDPR, supervised by the Autoriteit Persoonsgegevens — isn't "is the vendor compliant." Every serious vendor has a DPA. The question your DPO has to answer in writing is: where does the prompt physically execute, who can subpoena it, and what's the contractual recourse if a sub-processor changes.
Per-token API without a zero-retention rider: prompts and completions are processed in whichever region your account is routed to, may be retained for abuse-monitoring on a short window, and you have a standard DPA. Defensible for most B2B workloads, hard to defend if you're processing health data, legal correspondence, or payroll detail.
Enterprise contract with EU routing and zero retention: bytes execute in Frankfurt or Dublin, nothing is retained, sub-processors are pinned in the contract. Anthropic's commercial terms and the equivalent from OpenAI and Google both support this shape — you have to ask for it, not opt in via a checkbox.
Self-host on Hetzner in Falkenstein or Helsinki: bytes never leave the EU, no sub-processor exists below you, you are the controller and the processor. Maximally defensible. Also maximally yours.
The Christmas weekend test
This is the test that kills most self-hosting plans for sub-€16M companies. We run it as a thought experiment with every client considering the move.
It's December 27th. DeepSeek ships a new open-weights model that benchmarks eight points above your current checkpoint on the evals that matter to your customers. A competitor's CTO has already swapped it in by Sunday evening. Three of your largest customers are mid-pilot. Your ML engineer is in Brabant with her family and has WhatsApp on Do Not Disturb.
Who pulls the weights, rewrites the prompts, re-runs your eval set, redeploys the inference server, and watches the error rate on the dashboard?
On the API path: nobody. You read the release notes in January, run an eval in week 2, switch in week 3. Your on-call rotation never paged.
On the enterprise contract: your TAM emails you the migration guide. You file a ticket. Same January timeline, lower stakes.
On self-host: it's you. Or it's the same one engineer who set the box up and who you can't afford to lose. The opportunity cost of not upgrading is real — that's how your competitor took the deal — but the cost of upgrading is one engineer's holiday weekend, every time the open-source frontier moves.
A scoring sheet you can copy
Here is the matrix we walk clients through. Score each path 0–5 on each axis, weight by what your business actually cares about, sum.
axes:
cost_at_12_4M_calls: # €/month at projected volume
weight: 3
cost_at_3x_spike: # what happens when traffic triples on a Tuesday
weight: 2
avg_defensibility: # would your DPO sign this without notes?
weight: 4
time_to_model_swap: # hours from new SOTA released to deployed
weight: 3
ops_burden: # incremental on-call surface area
weight: 4
vendor_lock: # cost in weeks to swap providers
weight: 2
capability_ceiling: # can you serve your hardest workload?
weight: 3
paths:
per_token_api:
cost_at_12_4M_calls: 2
cost_at_3x_spike: 5
avg_defensibility: 3
time_to_model_swap: 5
ops_burden: 5
vendor_lock: 4
capability_ceiling: 5
enterprise_contract:
cost_at_12_4M_calls: 3
cost_at_3x_spike: 4
avg_defensibility: 5
time_to_model_swap: 4
ops_burden: 5
vendor_lock: 2
capability_ceiling: 5
self_host_llama:
cost_at_12_4M_calls: 5
cost_at_3x_spike: 1
avg_defensibility: 5
time_to_model_swap: 2
ops_burden: 1
vendor_lock: 5
capability_ceiling: 3
For the founder profile in the question — sub-€16M ARR Dutch B2B SaaS, 12.4M calls per month, AVG-sensitive but not health- or legal-tier — that matrix typically lands enterprise contract first, per-token API second, self-host third. Reverse the weights for a regulated-data shop and self-host moves up. Reverse them for a company at €1M ARR still finding product-market fit and per-token API wins outright.
The ACE spec changes one variable, not the answer
The reason the HN ACE-spec thread matters: if compute-quota standards become real and portable, the vendor-lock score on the enterprise path goes from a 2 to a 3 or 4. You can move a committed spend to a different provider mid-contract without rewriting your inference layer. That is meaningful. It doesn't make self-hosting cheaper, faster to patch, or easier to run. It just makes the enterprise contract less scary to sign. Don't let one HN thread move you off a decision the cost math and your DPO already agree on.
When self-hosting is actually the right call
Three signals say self-host regardless of company size:
- Your workload is narrow and stable. You are doing one thing — classify, extract, summarise — and the prompt hasn't changed in six months. Fine-tune once, serve forever.
- Your data is genuinely radioactive. Patient records, raw legal discovery, source code from clients under aggressive NDA. The AVG conversation isn't "defensible," it's "non-negotiable."
- You already have an ML engineer who wants to do this and has done it before. Not "could learn it." Has done it. Twice.
Two of those three: maybe. One: don't.
Our default for this profile
For a sub-€16M Dutch B2B SaaS at 12.4M calls a month today, we default our clients to a 12-month enterprise contract with EU routing and zero-retention, with a small fine-tuned open-source model on a Hetzner box for the 5–10% of traffic where you've identified a narrow, stable, AVG-sensitive workload. You get the cost advantage where it actually pays and the ops cover where the holiday risk lives.
When we built the inbox-triage agent for a Rotterdam logistics client running about 9M calls a month, we landed on exactly that hybrid: a tuned 8B model on our own hardware for the narrow extractor pulling reference numbers off scanned bills of lading, everything else on the frontier API under an enterprise rider. The longer write-up of how that split holds up in production is in our AI agents work.
The five-minute version: pull last month's invoice, divide it by your monthly call count to get your real per-call cost, then ask your DPO one question — "would you sign this DPA today without notes?" If the answer is yes and the per-call cost is below €0.006, you don't have a decision to make this quarter.
Key takeaway
For sub-€16M B2B SaaS at 12.4M monthly calls, an EU-routed enterprise contract with one tuned open-source model on the side beats both extremes.
FAQ
Which path actually has EU data residency?
An enterprise contract with explicit EU routing and a zero-retention rider does. Self-hosting on a Hetzner box in Falkenstein or Helsinki also does. Standard per-token API does not by default.
What does self-hosting Llama on Hetzner cost?
Roughly €1,800 to €4,500 per month in hardware rental for a single GPU box capable of serving a 70B-class model at modest QPS, before your engineer's time and on-call burden.
Is per-token API ever the long-term answer?
Yes — for variable traffic, fast-moving prompts, and workloads where the standard DPA is already defensible for your data class. It's the right default below about €1M ARR.
Does the ACE specification change the decision?
Only at the margin. It softens vendor lock on enterprise contracts. It does not make self-hosting easier to patch or cheaper to run at small scale.
When should we revisit the decision?
On any 3x change in call volume, a new data class entering the system, or a contract renewal window. Otherwise once every twelve months is enough.