← Blog

AI agents

Triage classifiers benchmarked: Mistral, Llama, Haiku 4.5

Mistral Small 3, Llama 3.3 70B on Groq, and Claude Haiku 4.5 ran through 1,928 anonymised Dutch citizen tickets. Real per-ticket costs, dialect floors, and the metrics public leaderboards never measure.

Jacob Molkenboer· Founder · A Brand New Company· 9 Jun 2026· 9 min
Three brass relays in a row on ivory paper, folded form with river stone, green sticky note, red ribbon, soft shadows.

It is Monday at a Dutch municipality of 180,000 residents. The shared citizen-mail inbox has 412 unread messages. Some report a broken streetlight on Kerkstraat. Some complain about a neighbour's barbecue. Two are addressed to the mayor personally, and one is a structured-help request from a parent who lost their job. The triage clerk is on holiday until Thursday.

This is the job we were asked to automate. Not generation. Not chat. Just routing, urgency, and a one-line summary for the case worker who eventually opens the ticket. Twenty-eight thousand of these a month, every month, with peaks on Monday morning when the weekend complaints land at once. We ran the same dataset through three classifier candidates over three weeks in May 2026: Mistral Small 3, Llama 3.3 70B served by Groq, and Claude Haiku 4.5. None of the public leaderboards told us what we needed to know to choose.

Classification is not generation

The model reads one Dutch-language email, picks one of fourteen departments, picks one of three urgency tiers, and writes a sentence the case worker reads first. That is the whole job.

Three things matter. Did it pick the right department. Did it pick the right urgency. Did it refuse, hallucinate a department that does not exist, or invent a fact the citizen never wrote.

What does not matter. MMLU score. Long-context recall over 200k tokens. Coding ability. Reasoning chain length. The leaderboards we read measured none of the things our procurement officer would actually be invoiced for.

The benchmark we ran

We took 2,000 tickets from the live inbox, anonymised them, and had two municipal staff label each one with the correct department, the correct urgency, and a one-sentence reference summary. Inter-rater agreement on the department label was 96.4%. We dropped the 72 tickets the labellers disagreed on, leaving 1,928 valid cases.

The prompt was identical for all three models: an eight-shot Dutch system message naming the fourteen departments with one example per department, then the raw email. Output was a JSON object with department, urgency, and summary_nl. No retries. No retrieval. No tool calls. We wanted to know what each model could do unaided.

We ran each model over the 1,928 tickets, three times, and took the median. Costs are billed cost in euros at the rate each provider charged us in May 2026, projected to a 28,000-ticket month and including prompt caching where the provider offered it. Latency is the p95 wall-clock time from request fire to JSON parsed and validated.

{
  "department": "Klantcontactcentrum",
  "urgency": "normal",
  "summary_nl": "Bewoner meldt kapotte lantaarn op de Kerkstraat ter hoogte van nummer 14."
}

Why these three and not others. Mistral Small 3 because the procurement officer needed an EU-incorporated provider option for the GDPR posture. Llama 3.3 70B on Groq because the speed claim sounded too good to be true and we wanted to confirm. Haiku 4.5 because it was the smallest Anthropic model with the new pricing that put it in budget range. We screened GPT-4o-mini in a smaller pilot, found it tied Mistral on Dutch and cost more, and dropped it.

Mistral Small 3, the cheap one

Mistral Small 3 (24B parameters, Apache-licensed, served via Mistral's own API) hit 89.2% department accuracy and 84.1% urgency accuracy. It cost €18 per month to classify all 28,000 tickets. It refused 0.1% of tickets, almost always when the citizen wrote something self-harm-adjacent that tripped its safety filter.

Where it lost was Dutch dialect. A ticket written in Limburgs or Twents got misrouted roughly twice as often as one written in Standard Dutch. It also hallucinated department names eight times in 1,928 tickets. The municipality has a "Klantcontactcentrum". Mistral kept inventing "Bureau Klantcontact", which would route those tickets into a folder that does not exist.

The rate-limit incident is worth a sentence. On the second Monday of testing, Mistral capped us at 60 requests per minute on the Small 3 endpoint without warning. We had not exceeded their published tier. Support resolved it in 48 hours, which is fine for a benchmark and not fine for production. Anthropic and Groq publish per-tier limits with a self-serve upgrade path. Mistral does not, at least not for our organisation.

Mistral's own release notes for Small 3 describe it as "a small model at the latency-and-quality frontier". That is fair on aggregate. For Dutch civic mail it is not the frontier. It is the floor we stayed above.

Llama 3.3 70B on Groq, the fast one

Groq serves Llama 3.3 70B at speeds that make you triple-check the timer. Our p95 was 310 milliseconds end-to-end, including TLS and JSON parsing. That is roughly four times faster than Mistral and three times faster than Haiku in our test.

The model hit 91.7% department accuracy and 86.4% urgency accuracy. Better than Mistral, worse than Haiku. The cost was €34 per month for 28,000 tickets at Groq's published rate for Llama 3.3 70B.

The catch was hallucinated departments. Llama 3.3 70B invented a department label 27 times across the run. Six of those picks sounded right but were not in our routing table. Twenty-one were department names from other Dutch municipalities the model had presumably seen in training. "Wijkteam Zuid" exists in Utrecht but not in our client. Catching all of them needed a strict JSON-schema validator with a retry, and the retry ate a chunk of Groq's latency advantage.

Speed is not free if you have to spend it on guardrails.

Claude Haiku 4.5, the boring one that won

Haiku 4.5 hit 94.3% department accuracy and 91.2% urgency accuracy. It cost €58 per month for the full 28,000-ticket volume at Anthropic's published Haiku rate with prompt caching on the eight-shot system. P95 latency was 920 ms. It hallucinated a department twice in 1,928 tickets, both times marked low-confidence in its own reasoning trace.

It handled Dutch dialect more honestly than the other two. A ticket in Twents about a "kapotte lantern" routed to public lighting cleanly. Mistral routed the same email to general complaints. Llama routed it to "Wijkbeheer Oost", a department that does not exist.

Haiku 4.5 is not the cheapest model. It is the model where, three months in, we have not had to write a single guardrail beyond the JSON-schema validator. That matters when the alternative is a case worker discovering a misrouted urgent ticket on day five.

Takeaway

For narrow classification at municipal scale, the model that costs €40 more per month saves roughly two operations-days per week in misroute cleanup. The cheap model is not cheap once a human has to fix it.

The numbers the leaderboards skip

Public leaderboards report MMLU, GSM8K, HumanEval, sometimes a Dutch benchmark from 2023. What they do not report is what shows up on your invoice and in your incident channel.

Hallucinated-class rate. The frequency at which a classifier invents a label outside the allowed set. Across our 1,928-ticket runs, the rates were 0.4% for Mistral, 1.4% for Llama on Groq, and 0.1% for Haiku. None of the three vendors publishes this metric on a benchmark page.

Cost at burst. Our peak hour is Monday 09:00 to 10:00, when about 1,400 weekend complaints land in a sixty-minute window. Per-1k-token cost is meaningless at burst. What matters is whether the provider rate-limits you, whether failover to a second region works, and whether p99 stays under your SLA. Groq had the best burst behaviour we measured. Anthropic had the cleanest graceful-degrade when we tripped a soft limit. Mistral capped us at a rate we had to phone-call our way past.

Refusal pattern on emotional tickets. Municipal mail includes a non-trivial number of citizens in crisis. A model that refuses to classify a self-harm-adjacent ticket as "social services, urgency high" because of its safety filter is worse than useless. It hides the urgent ticket. All three vendors have improved here in the last year, but the failure modes differ, and you have to test them on your real data, not a sanitised benchmark.

Dutch dialect floor. There is no public benchmark for Limburgs, Twents, or Brabants written Dutch. Public benchmarks test Standard Dutch on Wikipedia-style prose. Citizen mail is none of those things. The only way to know the dialect floor for a given model is to label a few hundred dialect tickets and measure. Three months in, we are still adding dialect examples whenever a misroute reveals a new pattern.

What we shipped

We shipped Haiku 4.5 as the primary classifier with Llama 3.3 70B on Groq as the fallback when Anthropic returns a 529 overloaded. The JSON-schema validator runs on every output regardless of which model produced it. Misroutes per week dropped from 38 (the human triage clerk's baseline) to 11 over the first month on Haiku. Average citizen wait-time to first response dropped from 38 hours to 6.

The actual pipeline is small. A Cloudflare Worker receives the citizen email, strips PII it does not need, calls Haiku with the eight-shot prompt, validates the JSON against the routing schema, and writes the routed ticket to the municipality's case-management system. The whole hot path is under 200 lines of TypeScript. The hard work was the labelled benchmark, not the code.

The recent press around frontier models converging matters here. When the top of the leaderboard flattens, the question stops being "which model is smartest" and starts being "which model costs least to put in front of citizens without a human watching". That answer is no longer obvious from a benchmark page.

When we built this citizen-mail agent for the municipality, the gap between the vendor benchmark and the metric our client actually got invoiced for was the entire project. We rebuilt the benchmark in two weeks and have reused the same harness on the next three AI agents we shipped. If you are picking a classifier today, label a thousand tickets and run them through three candidates before you sign a contract. The leaderboard cannot do that for you.

Key takeaway

Pick a classifier on your own labelled data, not a public leaderboard. The cheapest model is the one whose misroutes you do not have to fix.

FAQ

Why not just use a Sonnet-tier model and skip the comparison?

Sonnet is overkill for one-shot classification, and the cost at 28,000 tickets a month is hard to justify when Haiku 4.5 hit 94.3% on this task. Save the expensive models for jobs that need their capability.

How did you handle GDPR with US-hosted providers?

Mistral is EU-incorporated by default. Anthropic offers EU data residency on enterprise contracts. Groq does not, at time of writing, which is why it sits as the fallback and not the primary in our deployment.

Why not fine-tune an open-weights model on your own GPUs?

We costed it. Below roughly 200,000 tickets a month it does not pay back the labelling, hosting, and ops overhead. Above that volume it starts to win. This municipality sits below the line.

Can a smaller model handle Dutch regional dialects?

Not at acceptable accuracy in our test. Mistral Small 3 misrouted Limburgs and Twents tickets about twice as often as Standard Dutch. Haiku 4.5 closed most of that gap. The dialect floor is real and worth measuring.

ai agentsautomationcase studyoperationsarchitecturestrategy

Building something?

Start a project