AI agents

Self-hosting support LLMs: Llama 3, Mistral, Qwen on one A100

It is Tuesday morning. The support inbox has 41 open tickets, last month's API bill came in at €3,400, and ops is asking whether we could run this ourselves.

Jacob Molkenboer· Founder · A Brand New Company· 3 Jun 2026· 9 min

Brass relay switch, wooden patch-bay block, folded paper docket, chartreuse sticky note, wax seal on ivory paper.

It is Tuesday morning. The support inbox has 41 open tickets, last month's API bill came in at €3,400, and the head of operations is asking, in a tone that is not quite a question, whether we could just run this thing ourselves.

We get this conversation roughly once a quarter. Sometimes the trigger is the bill. More often it is data residency, a customer who refuses to have their tickets touch a US endpoint, or a hard latency target the API cannot hit from where the customer is sitting. Whatever the trigger, the answer always starts the same way: pick a model, benchmark it on hardware you can actually afford, and find out whether the math works before you commit.

This post is the writeup of a benchmark we ran in May for a client who handles tier-1 product support in three languages (English, Dutch, German). One NVIDIA A100 80GB, three open-weight model families, the same workload run on each. The numbers below are from our setup; your prompts and your traffic will move them around, but the shape is reliable.

The hardware budget

A single A100 80GB is the boring middle of the self-hosting market. You can rent one on-demand on Lambda or RunPod for roughly $1.50 to $2.50 per hour, or reserve one for around $0.80 to $1.20. At reserved pricing that is €700 to €1,000 per month, all-in. The same client was spending about €3,400 per month on a frontier API for the same support volume.

That €700-vs-€3,400 gap is the only reason this exercise is interesting. If your support volume is a few hundred tickets a month, stop reading. The API wins on every axis at that scale. If you are pushing tens of thousands of tickets a month and the conversations are long, keep reading.

The contenders

We picked the three families a small ops team is most likely to consider in mid-2026:

Llama 3.1 8B Instruct, the workhorse default. Strong English, decent Dutch, weaker on German nuance.
Mistral 7B Instruct v0.3, the European pick. Fast, small, well documented. We also tested Mistral Nemo 12B as a stretch option.
Qwen 2.5 7B Instruct, the multilingual surprise. Stronger out-of-the-box behaviour on non-English tickets than either of the above in our tests.

For each family we also ran the largest quantised variant that fits comfortably on 80GB of HBM2e: Llama 3.1 70B in AWQ-4bit, Qwen 2.5 32B in AWQ-4bit, and Mixtral 8x7B in AWQ-4bit. The smaller models all ran in BF16 with room to spare.

The serving stack

vLLM 0.6.x on a single GPU, behind a thin FastAPI gateway that handled auth, logging, and the prompt template per channel. We used continuous batching with default settings and turned on prefix caching for the system prompt and the static portion of the retrieval context.

If you have not used vLLM before, the short version is: it is the open-source serving layer that turns a Hugging Face model into something that can hold its own under real concurrency. PagedAttention plus continuous batching is what makes the throughput numbers below look reasonable on a single card.

We did not fine-tune. The point of this benchmark was to find out what these models do out of the box with a strong system prompt and good retrieval, because that is what 90% of teams will actually deploy on day one.

The workload

Two runs per model. The first was synthetic: 100 simulated agents each sending support-shaped prompts, with input lengths between 200 and 800 tokens (system prompt + retrieved KB chunks + the ticket) and output lengths between 80 and 300. The second was a replay of 500 anonymised historical tickets from the client's CRM, graded by two senior support agents on a 5-point scale for tone, factual accuracy, and "would I send this".

The retrieval layer was identical across models: a 4,000-document knowledge base, embedded with bge-m3, served from Qdrant, top-5 chunks per query. Worth saying out loud that there is real engineering between a chat model and a working support agent, and most of it lives in the retrieval layer. The retriever's recall determined more of the final answer quality than the choice of model did, every single time we measured.

Throughput and latency, batch 1

Batch 1 is the boring single-stream number. It is also the one your product manager will ask about, because it sets the floor on perceived latency for one user.

Model                    TTFT (ms)   Output tok/s
Llama 3.1 8B (BF16)         ~80          ~110
Mistral 7B v0.3 (BF16)      ~70          ~125
Qwen 2.5 7B (BF16)          ~85          ~115
Llama 3.1 70B AWQ-4bit      ~210         ~30
Qwen 2.5 32B AWQ-4bit       ~150         ~45
Mixtral 8x7B AWQ-4bit       ~130         ~50

At those rates, a 200-token answer from any of the small models lands in under two seconds. The 70B model lands in around seven. For chat in a sidebar widget, two seconds is fine. Seven is not.

Throughput under load

Batch 1 is not how you run a support agent. You run it with continuous batching, hold a pool of concurrent conversations, and measure aggregate throughput and p95 latency under load. With 32 concurrent streams:

Model                    Aggregate tok/s   p95 TTFT (ms)
Llama 3.1 8B (BF16)         ~2,400            ~340
Mistral 7B v0.3 (BF16)      ~2,800            ~290
Qwen 2.5 7B (BF16)          ~2,500            ~360
Llama 3.1 70B AWQ-4bit      ~620              ~1,100
Qwen 2.5 32B AWQ-4bit       ~900              ~780
Mixtral 8x7B AWQ-4bit       ~700              ~920

Mistral 7B wins the throughput race at the small end. That tracks: it is the smallest and the most efficient at the kernel level for this generation of GPUs. The 70B-class models are workable but only if your interface is async (an email reply, a ticket triage, a summarisation job) rather than synchronous chat.

Quality, scored by humans

Throughput is the easy half. The harder half is whether the answer is good enough to ship to a customer. Two senior support agents graded 100 randomly sampled replies per model on a 5-point scale, blind to which model produced which answer.

Model                    Tone   Accuracy   "Would send"
Llama 3.1 8B                4.0     3.7        67%
Mistral 7B v0.3             3.8     3.5        58%
Qwen 2.5 7B                 4.1     4.0        72%
Llama 3.1 70B AWQ           4.4     4.3        84%
Qwen 2.5 32B AWQ            4.5     4.4        86%
Mixtral 8x7B AWQ            4.2     4.0        74%

Qwen 2.5 was the small-model winner for us, by a clear margin. On the German and Dutch tickets in particular, the gap over Llama 3.1 8B was visible to both graders without prompting. The Qwen 2.5 model card describes multilingual training as a first-class goal; our results match that.

At the larger end, Qwen 2.5 32B AWQ matched Llama 3.1 70B AWQ on quality at roughly 1.5x the throughput. If you have only one A100, that is the model we would deploy in production today.

A note on quantisation, because someone always asks. AWQ-4bit shaved roughly 0.2 points off the human-graded accuracy score for the 70B and 32B models compared with BF16 runs on a temporarily-rented 2xA100. For tier-1 support that gap is invisible. For legal, financial, or medical replies, retest before you ship.

Takeaway

On a single A100 80GB in mid-2026, Qwen 2.5 32B in AWQ-4bit is the sweet spot for multilingual support. Mistral 7B is the throughput champion at the small end. Llama 3.1 8B is fine, but not the model we would pick.

The break-even math

The reserved A100 at €900 per month sits below the client's previous API spend of €3,400 per month. That sounds like a clear win, but the GPU is only one line item. The full self-hosted bill includes:

The GPU itself (€700 to €1,200 per month).
An on-call rotation, because the GPU host is now in your incident scope.
Engineering time to maintain the serving stack, the eval harness, and the prompt templates.
A second GPU for redundancy if the support agent is customer-facing and synchronous.

Honestly priced, self-hosting starts saving money around the €2,500-per-month API mark and starts being obviously cheaper around €5,000. Below that, the API wins and you should not be reading benchmark posts.

What we would change next time

Three things, in order of how much they would move the needle.

First, fine-tune. Every number above is out-of-the-box. A short LoRA on a few thousand of the client's own resolved tickets would push the "would send" rate well past 90% on the 7B models, based on prior projects. We did not include that here because we wanted clean base-model numbers, but for production we would always fine-tune.

Second, speculative decoding. vLLM now ships with a usable speculative decoding path. With Qwen 2.5 1.5B as the draft model paired with the 32B target, we saw roughly 1.7x speedup on the same A100, with no measurable quality loss. We left those numbers out of the table because we ran them on a follow-up box, but they are real and they matter once you push past 50 concurrent streams.

Third, the retrieval layer. The single biggest determinant of whether the support agent shipped good answers was not the model. It was the quality of the retrieved KB chunks. A bad retriever paired with a 70B model loses to a good retriever paired with a 7B model, every time.

The shortest path to a decision

If you are sitting on a six-figure annual API bill for support, the one-day experiment is this: spin up one A100 on a per-hour box, point vLLM at Qwen 2.5 7B and 32B, replay 200 historical tickets through both, and have your two best support agents grade the outputs blind. That is a single billable day of GPU rental. It will tell you whether the conversation with your CFO is worth having.

When we built the support agent for a Dutch SaaS client last quarter, the thing that surprised us was not the GPU economics. It was how much of the win came from the retrieval pipeline and the eval harness, not from the model choice. If you are weighing a similar move, that is where we would put your first week of engineering. We do this work as part of our AI agents practice, and the retrieval question is the one we spend the most time on.

Open a terminal today and run nvidia-smi on whatever GPU you can borrow for an hour. If the number you see makes you wince, the API is still the right answer for now.

Key takeaway

On a single A100 80GB in 2026, Qwen 2.5 32B AWQ is the sweet spot for multilingual support; Mistral 7B wins on raw throughput at the small end.

FAQ

When does self-hosting actually beat the API on cost?

Roughly above €2,500 per month in sustained API spend, and obviously above €5,000. Below that, the API wins once you price in on-call, redundancy, and engineering time.

Can a single A100 80GB run a 70B model?

Yes, in AWQ-4bit quantisation. Expect about 30 tokens per second single-stream and 600 to 700 aggregate under load. Use it for async work, not synchronous chat.

Which model handled multilingual support best?

Qwen 2.5 (7B and 32B) led on Dutch and German tickets in our blind grading, by a margin both senior support agents flagged without being told what they were scoring.

Do you need to fine-tune to ship a support agent?

For production, yes. A short LoRA on a few thousand resolved tickets typically pushes the would-send rate well past 90 percent on the 7B models. Retrieval quality matters even more.

ai agentschat agentsarchitecturetoolingoperationsrag

Building something?

Start a project