AI agents
Ollama vs vLLM vs llama.cpp: on-prem inference for AVG work
A Dutch municipal-services vendor in Enschede runs 2,600 burger-intake requests a week on one RTX 6000 Ada. We compared three inference stacks on cost, logs and on-call hours.

It is 07:42 on a Sunday in Enschede. The RTX 6000 Ada in the half-rack at a 23-person municipal-services vendor has stopped responding. nvidia-smi hangs. Tomorrow at 08:30 the burger-intake agent has to start triaging the Monday batch: roughly 520 aanvragen between 08:30 and 12:00, then another 280 in the afternoon. The on-call engineer is cycling back from her mother's house in Lonneker. The CISO is reading the kernel log on her phone over koffie.
This is the moment your choice of inference stack actually matters. Not the throughput benchmark on a Mistral-Nemo blog post. Not the token-per-second curve at batch size 64. The thing that matters is: can you reboot, who reads the logs, and what does the Functionaris Gegevensbescherming say when a citizen files a Woo-verzoek next month asking what the model saw.
We have spent the last eight months running this exact shape of workload for a Dutch vendor that sits between municipalities and citizens. Two thousand six hundred weekly burger-intake conversations. One GPU box. A small ops team. Below is what we learned about the three inference stacks that everyone in this space ends up shortlisting: Ollama, vLLM, and LM Studio paired with llama.cpp.
The workload, in numbers
The agent answers structured intake questions: parking-permit changes, WMO follow-up, waste-collection schedules, and the long tail of "I got this letter, what does it mean." Average conversation: eight turns, around 1,400 input tokens, around 600 output tokens. That works out to about 5.2 million tokens a week if you count both directions, or 270 million per year before holidays and storms.
One RTX 6000 Ada, 48 GB VRAM, sits in a rented half-rack in a Dutch datacentre with audited physical access. Mistral-Nemo-Instruct-2407 at 8-bit quant for the front-line agent. A smaller fine-tune for category classification. The whole thing has to be AVG-defensible and Woo-loggable: which prompts went in, which outputs came out, who saw them, when they were deleted. Nothing leaves the building.
Ollama: the one that ships in a weekend
Ollama is the easiest sell to a four-person dev team. Install, pull a model, point your code at http://localhost:11434, ship. The Modelfile abstraction lets a non-Python developer set a system prompt and a temperature without touching CUDA.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral-nemo:12b-instruct-2407-q8_0
sudo systemctl enable --now ollama
curl http://localhost:11434/api/chat -d '{
"model": "mistral-nemo:12b-instruct-2407-q8_0",
"messages": [{"role":"user","content":"Wat is een WMO-aanvraag?"}],
"stream": false
}'
Where Ollama earns its keep is the operations side. The systemd unit it installs is clean. journalctl -u ollama shows you what is happening. When the RTX 6000 wedges, a reboot brings it back without a 40-line YAML reload. Default logging is request-level: timestamp, model, duration, token counts. Not the prompts themselves; you wrap the proxy for that.
The honest cost of Ollama at our workload is throughput. Single-stream KV cache, batched only across explicit concurrent requests, no PagedAttention. At 2,600 weekly aanvragen we sat around 38 tokens per second on Mistral-Nemo at int8, with peak concurrency around six. Monday morning queued. It cleared by 09:15. Per-1k-token amortised cost over a three-year GPU lease worked out to roughly €0.00021 input, €0.00021 output. Cheaper than any commercial API tier, ignoring the labour to maintain it.
vLLM: the one that scales and bites you
vLLM is the production choice if you are honest with yourself about throughput. PagedAttention and continuous batching mean the same RTX 6000 will push four to seven times more tokens per second at concurrency above four. For our workload that meant 180 to 220 tokens per second on the same Mistral-Nemo, with the queue never building past two requests deep. The vLLM OpenAI-compatible server docs walk through the configuration surface.
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--quantization fp8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 16 \
--disable-log-requests
The cost story is better on paper: roughly €0.00008 per 1k tokens at our concurrency, against the same hardware lease. You will spend that saving back on Sunday mornings. vLLM is a Python process with a CUDA-version-pinned wheel chain. When NVIDIA pushes a driver upgrade that the kernel auto-applies on a Saturday-night reboot, vLLM does not start. The error message is RuntimeError: CUDA error: no kernel image is available for execution on the device, and the fix is either a wheel rebuild or a driver downgrade. Either is fine on a Tuesday. Neither is fine at 07:42 on a Sunday.
If you choose vLLM, pin the NVIDIA driver version explicitly. Hold the meta-package in apt, configure unattended-upgrades to skip nvidia-*, and put the running driver version into your monitoring. A silent driver bump is the single most common cause of a Sunday-morning wedge in our incident log.
Logging on vLLM is structured if you turn it on. The --disable-log-requests flag exists for a reason: by default, request bodies hit stdout. That is the wrong default for any deployment that touches BSN numbers, intake forms, or citizen surnames. We send vLLM's structured access log to a separate aggregator, redact at the proxy layer before the request reaches vLLM, and keep the model server at INFO with no request bodies.
LM Studio with llama.cpp: the desktop-class stack that surprised us
The third stack started as a developer toy. LM Studio gives you a desktop UI for picking a quant, loading it into llama.cpp's server mode, and pointing an OpenAI-compatible client at it. It is the stack the homelab threads on Hacker News keep recommending, and it is the one our client's CTO had already used for a weekend prototype before we arrived.
For production, you drop LM Studio's UI and run llama-server from llama.cpp directly. The argument surface is honest:
./llama-server \
--model models/Mistral-Nemo-Instruct-2407-Q8_0.gguf \
--host 127.0.0.1 \
--port 8080 \
--n-gpu-layers 99 \
--ctx-size 8192 \
--parallel 8 \
--cont-batching \
--metrics
The llama.cpp server is a single static binary. No Python. No virtualenv. No wheel pinning. When CUDA wedges, you rebuild the binary with the new CUDA version in twelve minutes from a known commit. You check out a tag, run cmake -B build -DGGML_CUDA=ON, copy the binary, restart the systemd unit. The same Sunday wedge that costs vLLM two hours costs llama.cpp twenty minutes. We measured it twice this year.
Throughput sits between Ollama and vLLM: roughly 90 to 110 tokens per second on Mistral-Nemo at the same quant, with continuous batching enabled. That is enough headroom for 2,600 weekly aanvragen, with room for the volume to double before we have to rethink. Per-1k-token amortised cost: about €0.00015 across both directions.
The logging story is the cleanest of the three. llama-server exposes a /v1/chat/completions endpoint that mirrors the OpenAI shape, which means your AVG-defensible logging lives in the proxy in front of it, not in the model server. That is the right place for it. The model server should not know who the citizen is.
What we actually shipped
For this client, the decision was llama.cpp with a thin FastAPI proxy in front. Three reasons.
First, the on-call engineer can rebuild from source on a Sunday with a runbook that fits on one A4. vLLM cannot promise that. The wheel chain assumes you have a working development environment, the right CUDA toolkit, and an hour. None of those things are true at 07:42.
Second, the logging surface is in our code, not in a third-party server. When a Woo-verzoek arrives, we know exactly which fields were captured, which were redacted, and which were never written. The model server does not have a debug mode that could leak a prompt body into journalctl.
Third, throughput is sufficient. We did not need vLLM's four-times multiplier. We needed predictable Mondays.
If your weekly volume is closer to 26,000 aanvragen instead of 2,600, the calculation flips. vLLM's per-token cost advantage compounds, and the operational risk is justified because you have a dedicated platform engineer instead of a CTO who also writes the SLA. Below 5,000 weekly turns, llama.cpp wins. Above 20,000, vLLM wins. In between, you are choosing between Sunday hours and throughput, and you should pick the one your team can actually carry.
AVG and Wet open overheid: the part the benchmark posts skip
The benchmark posts compare tokens per second. The auditor compares retention windows, processing-purpose declarations, and the log line that proves you did not leak a BSN into a prompt.
What worked for us: the Autoriteit Persoonsgegevens guidance on processing logging is opinionated about what counts as a defensible audit trail. We log the following per request: timestamp, anonymised session id, intake category, input token count, output token count, model version, redaction-pass version. We do not log the prompt body or the model output in the same store. The prompt body, after PII redaction, goes to a separate seven-day rolling store with a documented purpose. The model output goes to a thirty-day store tied to the case file.
This separation is easier with llama.cpp because the model server is dumb. It does not log a prompt unless you ask it to. With Ollama, the same is true but the log format is less structured. With vLLM, you must remember to set --disable-log-requests or you have just written a citizen's WMO query to systemd-journald, and now the retention clock on that journal file is your problem too.
The Wet open overheid angle is the one that municipal vendors often miss. A citizen can request the record of how their intake was processed. "The model decided" is not a defensible answer. "Here is the timestamped log of which model version handled the request, which redaction pass ran first, and what category the agent assigned" is. The choice of inference stack is upstream of that capability, but only if you build the proxy that captures it.
The smallest thing you can do today
Open a terminal on whatever GPU box you have. Pin the NVIDIA driver version in your package manager, write the current driver version into a file at /etc/abn/driver.lock, and add a single line to your monitoring that alerts if the running driver version differs from the lockfile. That one check would have caught three of the five Sunday wedges we have seen across municipal-services clients this year.
When we built the burger-intake agent that runs this stack, the gap we kept hitting was between what the model server logs and what the auditor wants to see. We ended up solving it by putting the entire AVG-defensible layer in a FastAPI proxy in front of the inference layer, so the model server stays dumb and the audit trail lives where we can reason about it.
Key takeaway
Pick the inference stack your Sunday on-call can rebuild from a one-page runbook. Throughput is Tuesday's problem; driver wedges are Sunday's.
FAQ
Which on-prem LLM stack is cheapest per 1k tokens?
vLLM at high concurrency, llama.cpp in the middle, Ollama at low concurrency. On amortised hardware the differences are small; vLLM only pulls ahead above roughly 20,000 weekly turns.
Is Ollama AVG-compliant out of the box?
No runtime is. Ollama does not log prompt bodies by default, which helps, but compliance lives in the proxy, the retention policy, and the redaction pass. Pick the stack, then build the audit layer.
Can llama.cpp match vLLM throughput?
Not at high concurrency. With continuous batching it reaches about half of vLLM's tokens per second on the same hardware. At a few thousand weekly conversations the gap does not matter; above 20,000 it does.
What causes a Sunday CUDA wedge?
Almost always an unattended NVIDIA driver upgrade applied during a maintenance reboot. Pin the nvidia-* packages, hold the meta-package, and alert when the running driver version drifts from a lockfile.