Strategy

On-premise AI agents: why Dutch law firms can't use OpenAI

A Dutch advocaat can't upload a client file to ChatGPT. AVG, NOvA rules and Schrems II make sure of that. Here is the on-prem stack we ship instead, and what it actually costs.

Jacob Molkenboer· Founder · A Brand New Company· 7 Mar 2024· 6 min

Leather dossier with wax seal, brass bell, ribboned document, green bookmark on ivory desk by canal window.

The partner at a 40-lawyer firm in Amsterdam pulled up the OpenAI privacy policy on a Tuesday morning. He read the data-processing addendum twice, scrolled to the sub-processor list, and closed the tab. The next email he sent was to us: "Can we run something local? Same idea, our hardware, no American backend."

That conversation has happened five times in the last year. The firms vary in size, the practice areas vary, but the answer to "why not just use ChatGPT Enterprise" is always the same shape. It is not paranoia. It is the AVG, the NOvA rules, and Schrems II doing their job.

The constraint, in plain terms

A Dutch advocaat handling a confidential matter sits under three overlapping obligations. The Algemene Verordening Gegevensbescherming (the EU's GDPR, implemented locally) restricts where personal data may be processed. The Nederlandse Orde van Advocaten imposes a beroepsgeheim, the professional confidentiality that survives every contractual disclaimer a vendor offers. And since Schrems II in 2020, transferring personal data to US-based processors requires supplementary measures that, in most readings, encryption in transit cannot satisfy on its own when the processor is subject to FISA 702.

OpenAI processes data in the US. So does Anthropic. So does Google when you use Vertex in most configurations. The Dutch DPA (Autoriteit Persoonsgegevens) has been more vocal than most national regulators about what that means in practice: a law firm cannot upload a deposition. It cannot upload a contract under negotiation. It cannot use a US-hosted assistant to summarise a case file involving a named natural person. The blast radius is the entire core workflow.

What on-premise actually means in 2026

The term is doing some work. We use it to cover three concrete deployments, in descending order of how often we ship them.

The first is a dedicated EU instance. The model runs on a machine in Falkenstein or Helsinki, on Hetzner or Scaleway. No US sub-processor sits in the request path. Data leaves the firm, but it stays in the EU on hardware leased by the firm. We sign a DPA. The firm sleeps.

The second is a colocated rack. Real on-prem. The firm already has a server room because they kept a Citrix farm running through three IT directors. We add two boxes with H100s or, more often, with L40S cards because the model fits in 48GB and the budget is not infinite. The model runs in the building. The model does not phone home.

The third is air-gapped. No internet at all. A workstation, a model on local SSD, and a chat interface that runs on the firm's intranet. We have built this twice. Both times for matters involving state security review. It is a different product from the other two and we price it like one.

The model stack

We do not pick OpenAI-compatible models because OpenAI compatibility is the goal. We pick them because vLLM speaks that protocol and our orchestration code does not need to care which weights are loaded. In practice, the shortlist for Dutch legal work is short.

Mistral Large 2 sits at the top for any task touching Dutch language reasoning. It was trained by a French company, it handles Dutch case-law citations without choking on the abbreviations, and the licence permits commercial use under Mistral's terms. Mistral is also, conveniently, in the EU.

Llama 3.3 70B is the workhorse for English-language summarisation and structured extraction. Open weights, runs on a single 8xL40S box at acceptable latency, and the licence is permissive enough for commercial deployment under 700 million MAU.

Qwen 2.5 72B is the dark horse. Strong on long-context tasks, weak on Dutch idioms, but we have used it inside RAG pipelines where retrieval does the heavy lifting and the model just has to be a good reader.

For embeddings, we standardise on BGE-M3. It reads Dutch and English well, runs on CPU at acceptable throughput for nightly indexing jobs, and the licence is MIT.

# Minimal vLLM serve config we ship for a 70B model on 4xL40S
model: meta-llama/Llama-3.3-70B-Instruct
tensor-parallel-size: 4
max-model-len: 32768
gpu-memory-utilization: 0.92
dtype: bfloat16
served-model-name: firm-llm
port: 8000

That is the entire serving layer. The rest of the work is in retrieval, evaluation, and the boring parts of running a service.

The trade-offs nobody mentions in the keynote

The benchmarks that make it into the press almost always test the frontier API models. GPT-4o, Claude Opus, Gemini Ultra. A Dutch firm cannot run those. So the relevant question is not "can AI outperform a senior associate" but "can the model we are allowed to run outperform the alternative, which is no AI at all."

The honest answer is: yes, but the gap is narrower than the keynote suggests. Llama 3.3 70B on a well-tuned RAG pipeline lands within reach of GPT-4o on extraction and summarisation, lags meaningfully on complex multi-hop reasoning, and beats nothing on creative drafting. We tell partners this in the first meeting. The ones who keep meeting with us are the ones who needed extraction and summarisation in the first place.

Warning

An "EU-hosted" badge from a US vendor is not the same thing as EU sovereignty. If the parent company is subject to the US CLOUD Act, the data is reachable regardless of where the server sits. Read the corporate structure, not the marketing page.

The second trade-off is operational. A managed API is one vendor invoice and a status page. A self-hosted stack is GPU drivers, vLLM upgrades, model swaps when a new release lands, and the inevitable Tuesday morning where the firm's IT manager calls because nvidia-smi is returning ERR. We bake monitoring and a runbook into every delivery, and we still get the call.

The third is cost. People assume on-prem is cheaper. At the small end it is not. Below roughly two million tokens per day across all users, the API beats the rack on pure economics. Above it, the rack wins. Below it, the firm is paying for sovereignty, not savings, and we make sure they know that before they sign.

The five-minute audit

If you run technology at a firm and you are wondering whether any of this applies to you, here is the test. Open your AI vendor list. For each one, find the sub-processor page. Note the jurisdiction. Then open a representative case file in your DMS and ask whether you would be comfortable emailing that file to a colleague in that jurisdiction. If the answer is no, the AI vendor is not a fit for that workload. That is the whole audit. It takes five minutes and tells you everything.

When we built the first on-prem agent for a mid-size Dutch litigation practice last year, the thing we ran into was not the model, the GPUs, or the RAG pipeline. It was the firm's document management system, which had been running on a Drupal 7 backend nobody had touched since 2017. We ended up doing the AI agent build and a quiet legacy migration in the same engagement. That is usually how these projects go.

Key takeaway

On-prem AI for Dutch law firms is not a paranoia tax. It is the AVG, the NOvA beroepsgeheim and Schrems II doing their job. Build accordingly.

FAQ

Can a Dutch law firm legally use ChatGPT for client work?

Not for personal data covered by the AVG and the NOvA beroepsgeheim. The Schrems II ruling makes US-based processing of confidential client data hard to justify without supplementary measures most firms cannot meet.

What does on-premise mean for an AI deployment in 2026?

Three flavours: a dedicated EU cloud instance, a colocated rack in the firm's own server room, or a fully air-gapped workstation. The right pick depends on the matters being handled, not on the budget.

Is self-hosted AI cheaper than the OpenAI API?

Below roughly two million tokens per day across the firm, the API wins on cost. Above that, the rack wins. Firms run on-prem for sovereignty and confidentiality, not for savings.

Which open model handles Dutch legal language best?

Mistral Large 2 leads on Dutch case-law citations and idioms. Llama 3.3 70B is stronger for English summarisation. Both run cleanly under vLLM on EU GPU infrastructure.

ai agentsstrategyarchitecturesecurityoperationsbusiness

Building something?

Start a project