Strategy

Support ticket routing: build a classifier or pay per token

Three numbers decide whether you should train a 50-line classifier for your support inbox or send every ticket to a frontier model. Volume, drift, and who retrains it Monday.

Jacob Molkenboer· Founder · A Brand New Company· 9 Jun 2026· 6 min

Brass letter-sorting rack with three cream envelopes, one with green flag, red wax seal on linen, ivory paper.

It is Tuesday afternoon. The ops lead at a fourteen-person Dutch SaaS opens her routing dashboard. Six hundred tickets came in last week. About sixty went to the wrong queue. The CTO has asked whether they should “just use AI” to fix it. Two proposals sit on her desk: build a tiny classifier in-house, or pipe every ticket through a frontier model. Neither answer is obvious. The wrong one will cost roughly €18k in the first year.

This is a question we get pulled into about once a quarter. It reads like a build-versus-buy decision. It is not. It is a three-variable scoring problem, and the variables usually live in a spreadsheet the company already has.

The two roads

Road one is a small supervised classifier. fastText, logistic regression on TF-IDF, or a distilled encoder. Train it on six months of past tickets that already carry a category. Run it on a single CPU. Cost per inference is a fraction of a cent in compute, plus ops time to retrain when the data shifts.

Road two is a frontier model called per ticket. Hand it the ticket text and a prompt that lists your queues. Receive a queue name back. Cost per inference lands somewhere between €0.001 and €0.02 depending on context size and tier. No training. No retraining. You inherit the vendor’s roadmap.

Both work on day one. The interesting question is which one stays cheap and correct in eighteen months.

The three axes

We score the decision on three numbers the company already knows.

Volume. Tickets per day. Easy to pull from any helpdesk.

Drift risk. How often does the queue structure or the language of incoming tickets change? Did marketing rename a product line last quarter? Did the company just open Germany? Does a new feature ship every month that nobody categorises correctly the first week?

Retrain capacity. Who on the ops team can actually retrain the classifier when accuracy slides to 84%? If the honest answer is “we’d file a ticket with our dev shop,” that is a real number. It is the number of weeks between drift detection and a fix.

Notice what we do not score on: accuracy on day one. Both roads start north of 90% on a well-defined queue set. The fight is months three through twelve.

The scoring rubric

Give each axis a score from 1 to 3 for each road, then add the columns.

Volume. Under 200 tickets per day: 1 toward classifier, 3 toward frontier. Per-ticket cost is rounding noise and you save the training labour. 200 to 2,000 per day: 2 each. Over 2,000 per day: 3 toward classifier. At that volume, even €0.003 per call is €180 per day, about €5.5k per month. A classifier on a single t4g.small pays for itself by week three.

Drift risk. Stable taxonomy, no major language shift expected: 3 toward classifier, 1 toward frontier. You retrain twice a year and move on. Moderate drift, product expanding: 2 each. High drift, new queues every quarter, multilingual rollout under way: 1 toward classifier, 3 toward frontier. The classifier’s accuracy will collapse and the retrain cycle will not keep up.

Retrain capacity. Someone in-house can run a Jupyter notebook and read a confusion matrix: 3 toward classifier. Ops can label new examples but needs help with the actual retrain: 2 each. Nobody can: 1 toward classifier, 3 toward frontier. Do not build something that will rot.

Add the columns. Higher wins.

Takeaway

The frontier model is the right answer for low-volume, high-drift teams with no ML capacity. The classifier wins everywhere else, often by a wide margin.

A worked example

A nine-person Dutch HR-tech SaaS, roughly €1.2M ARR, 450 support tickets a day across six queues. Stable product. Slow taxonomy change. Their head of customer ops once shipped a Streamlit dashboard for fun.

Volume: 2 versus 2. Drift: 3 versus 1. Retrain: 3 versus 1. Classifier wins 8 to 4.

We trained a logistic regression on TF-IDF features over Dutch and English combined, on eighteen months of historical tickets. Twenty-two minutes of compute. Accuracy on a holdout set: 91.4%. Cost to run: one €6 per month VPS. Cost to retrain quarterly: about three hours of the ops lead’s time, plus a checklist.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1, 2), min_df=3, max_features=50_000)),
    ("clf", LogisticRegression(max_iter=2000, class_weight="balanced")),
])

pipe.fit(tickets_train.text, tickets_train.queue)
print(pipe.score(tickets_test.text, tickets_test.queue))

Counter-example. A four-person legal-tech startup, 60 tickets a day, three new product surfaces planned this year, no ML person. Volume: 3 versus 1. Drift: 1 versus 3. Retrain: 1 versus 3. Frontier wins 7 to 5. At 60 tickets a day, paying €0.004 per classification is €7 per month. A classifier would have been the slower, more expensive answer.

Where the method breaks

Two failure modes show up often enough to name.

First, people score volume on today’s number and forget projected growth. A SaaS at 200 tickets a day this month and tripling year over year is a classifier company in eighteen months. Score it accordingly.

Second, they overestimate ops capacity. “Someone can retrain it” turns out to mean “the CTO will do it on a Saturday, twice.” That is not a maintenance plan. If the retrain capacity score is uncertain, round it down.

Warning

A classifier you cannot retrain is worse than no classifier. It will lose accuracy quietly while everyone believes the routing is fine.

The hybrid the rubric hides

There is a third option the scoring obscures: route the confident 80% with a classifier and escalate the low-confidence tail to a frontier model. The classifier handles volume. The frontier model handles drift and edge cases. We have shipped this pattern twice. It only works if your team can read a confidence histogram and tune a threshold. If they cannot, pick a lane.

Hacker News spent this week arguing that frontier capability gains are slowing. Whether that is true at the very top or not, the implication for ticket routing is small. Per-token prices keep falling. Classifier costs are already near zero. The decision is still about your data and your team, not about which way the capability curves point this quarter.

What to do this week

When we built the support-router for a Dutch HR-tech SaaS this spring, the thing we ran into was not accuracy. It was getting the ops lead a retrain notebook she could run without us. We ended up writing a 40-line script that pulls last month’s tickets, retrains the model, and produces a confusion matrix as a PDF. That kind of unglamorous detail decides whether a small AI agent survives its second year.

Smallest thing you can do today: pull last month’s tickets from your helpdesk, count them, count how many landed in the wrong queue, and score your own team honestly on the three axes. The answer is usually clear inside thirty minutes.

Key takeaway

The frontier model wins when volume is low, drift is high, and nobody can retrain. The classifier wins everywhere else, often by a wide margin.

FAQ

What accuracy should we expect from a small classifier?

A tuned logistic regression on TF-IDF over six months of labelled tickets usually lands between 88% and 93% on a holdout set. A distilled encoder can push that to 94 or 95%.

How often do we need to retrain it?

For a stable product, once a quarter is normal. For a product expanding into new markets or features, every six to eight weeks. Tie the schedule to confusion-matrix drift, not a calendar.

Can we use a frontier model only for low-confidence cases?

Yes. Route the confident classifier predictions directly and escalate the bottom 10 to 20% to the larger model. It works when your ops team can read a confidence histogram.

Does latency matter for ticket routing?

Rarely. A classifier returns in under 50ms. A frontier model returns in 500ms to 3 seconds. The human reading the ticket is almost always the bottleneck.

strategyai agentsautomationarchitectureoperationstooling

Building something?

Start a project