← Blog

Chat agents

Chat agent for hosting support: 62% of tier-1 in 4 minutes

A 24-person Haarlem managed-hosting reseller closes 62% of tier-1 tickets in under four minutes. The trick: a chat agent that refuses any action it cannot snapshot first.

Jacob Molkenboer· Founder · A Brand New Company· 23 Feb 2025· 9 min
Paper ticket on brass spike, leather notebook with green ribbon, red wax seal on ivory paper surface.

It is 21:14 on a Saturday in Haarlem. The on-call engineer at a 24-person managed-hosting reseller is two beers into a barbecue when his pager buzzes. A WordPress site that ships outbound order confirmations is bouncing every message with a 550 reject. The ticket has been open for nine minutes.

The chat agent on their support panel has already done the first four steps he would have done. It pulled the affected domain, checked SPF and DKIM, found the broken record, drafted the fix. It is waiting for him to approve the DNS change. He nods through his phone. The ticket closes at minute eleven.

Three months earlier, that ticket would have sat in the queue until Monday morning.

This post is the architecture and the honest results of the chat agent now running tier-1 for that reseller. They prefer not to be named, so we will call them the Haarlemmer. The headline number is real: 62% of their tier-1 hosting tickets close inside four minutes without an engineer touching them. The interesting part is what the agent refuses to do.

The before picture

The Haarlemmer runs roughly 9,000 sites for Dutch SMEs and a long tail of one-person agencies. Their stack is DirectAdmin on bare metal in two Dutch data centres, billed through WHMCS. Customers call the hosting panel "cPanel" because that is what they have always called it. Most of them migrated off cPanel two years ago, when the licence pricing climbed.

Their tier-1 ticket mix, sampled over a normal week before we touched anything:

  • Roughly a third was email delivery: DNS, SPF, DKIM, blacklist hits.
  • A fifth was "site is down" that turned out to be PHP version mismatches or silently failed SSL renewals.
  • About one in six was mailbox quota or password resets.
  • One in seven was database connection limits.
  • The rest was everything else, including the genuinely hard ones.

Average first-response time was 47 minutes during business hours and six hours plus on weekends. Their support lead, who has been doing this for nine years, could close most tier-1 tickets in under five minutes when she got to them. The bottleneck was never skill. It was queue depth.

What we built

Three pieces, stitched together with care about which one is allowed to do what.

WHMCS as the spine

Every ticket lives in WHMCS. Every customer has a client_id there. We did not move the ticket queue to a new tool. We hung the agent off WHMCS via its public API, so the source of truth for "did this ticket close" stays where the support lead already looks. Anything the agent does shows up in the ticket as a reply with a clear robot signature.

A read-only DirectAdmin mirror

The agent never logs into the live DirectAdmin API. It reads from a mirror we sync every 90 seconds: domain list, mailbox status, current PHP handler, last backup timestamp, current zone file, quota usage. Everything the agent needs to diagnose, nothing it could use to break things.

This sounds paranoid. It is. The point is that the agent can answer "what is the SPF record on this domain right now" in 40 milliseconds without ever holding a credential that could write to production.

A Claude-routed escalation gate

When the agent decides it wants to change something (rewrite a zone file, restart a process, reset a mailbox password), it does not just do it. It calls a gate. The gate has one rule: every action that mutates state must produce a snapshot first, scoped to exactly what is about to change, with a one-hour TTL. If the snapshot fails, the action does not happen. The ticket escalates to a human with the diagnosis attached.

// snapshot-gate.js
// Refuses any action the agent cannot atomically undo.

async function gateAction(action, ctx) {
  if (!action.mutates_state) {
    return { allow: true, snapshot: null }
  }

  const snap = await directadmin.snapshot({
    domain: ctx.domain,
    scope: action.scope, // 'zone' | 'mailbox' | 'site' | 'db'
  })

  if (!snap.ok) {
    return {
      allow: false,
      reason: `no_snapshot: ${snap.error}`,
      escalate_to: 'human',
    }
  }

  await audit.write({
    ticket: ctx.ticket_id,
    action: action.name,
    snapshot_id: snap.id,
    ttl_minutes: 60,
  })

  return { allow: true, snapshot: snap.id }
}

The escalation routing itself is a small Claude call that reads the diagnosis, the proposed action, and the customer's plan tier, then decides whether this goes to tier-1 review, tier-2, or straight to the on-call. The model is not the thing closing tickets. The model is the thing deciding which human should not have to look at this one.

Takeaway

The agent's job is not to be right. It is to be reversible. Everything else falls out of that.

How the 62% breaks down

The headline is "62% of tier-1 closed in under four minutes." That number is honest but it is boring without the shape.

  • Email delivery: roughly four in five auto-resolve. Almost all of these are missing SPF includes after a new mail provider, or a DKIM selector mismatch. Both are mechanically diagnosable from the read-only mirror.
  • "Site is down": a little over half auto-resolve. PHP version bumps, SSL renewals that failed silently, and one specific WordPress plugin that consumes all available PHP-FPM workers.
  • Mailbox issues: three in four auto-resolve. Quota raises within plan, password resets after identity verification through WHMCS.
  • Database limits: only about one in five auto-resolves. Most of these need a human because the fix is "your code has a leak," and the agent is not allowed to be the one telling a customer their developer made a mistake.

The 38% the agent does not close fall into two camps. Tickets where the diagnosis is clear but the fix needs a human signature (renaming a primary mailbox, restoring a database, anything touching billing). And tickets where the agent's confidence is below a threshold we set deliberately high.

The escalation gate, in practice

The gate is the part of the system that took the longest to get right. The first version was a confidence score. We threw that out after three weeks, because confidence scores trained on past tickets do not predict the specific way a new ticket will break.

The version we shipped uses three checks, in this order:

  1. Can this be snapshotted? If not, escalate. No exceptions.
  2. Is the action on the agent's allowlist for this customer's plan? Plan tiers have different blast radii. A €9 per month customer's agent cannot restart MariaDB. A €290 per month customer's agent can, after snapshot.
  3. Has a human approved this action class for this customer in the last 90 days? If yes, the agent acts and posts the reply. If no, the agent prepares the reply and the action, and asks a human to approve once. After that, this class of action is pre-approved for the customer for 90 days.

That third check is the one the support lead asked for after week two. Her instinct was right. Most of the friction was not "can the agent do this," it was "has this customer agreed to the agent doing this." Treating consent as a per-customer, per-action-class, time-boxed thing made the agent feel less like an intrusion and more like a junior who learned each customer's preferences.

What broke

Two things, both worth telling honestly.

The first broke in week four. A customer's mail had been routing through a third-party filter for two years. The agent diagnosed a "missing MX record" and proposed adding the DirectAdmin default. The snapshot saved us. The rollback was clean. But the agent had been confidently wrong, and its wrongness looked exactly like its rightness in the ticket. We fixed it by adding a check for any third-party MX, SPF include, or DKIM selector older than 30 days, and treating those as "human only" by default. The agent now writes a paragraph explaining what it sees and asks a human to confirm intent.

The second broke in week seven and was funnier. A customer's WordPress site had been hacked. The agent, asked "why is my site slow," correctly identified abnormally high PHP-FPM worker usage, suggested a worker count bump, and snapshotted the config. It was not wrong about the worker count. It was wrong about the question. The site was slow because it was mining cryptocurrency in the background. We added a class of pattern checks (process names that do not belong, outbound connections to known mining pools, file modification spikes in wp-content/uploads) that always escalate, regardless of how mechanical the surface symptom looks.

Warning

An agent that is confidently solving the wrong problem will still close tickets. Watch your reopen rate, not just your first-close rate.

Why the snapshot rule matters more than the model

There is a German court ruling working through the news this month about whether Google is liable for false answers in its AI Overviews. The specifics will get litigated for years. The general direction is what every hosting company should already assume. If your platform answers a customer's question, the answer is yours. "The AI said so" is not a defence, and will not be one.

The snapshot gate is how you make that survivable. Not because the agent will never be wrong (it will), but because every wrong thing it does to a server can be undone in 60 seconds by a junior on-call. The model is replaceable. The reversibility is the moat.

This is also the part that, in our experience, makes hosting teams comfortable enough to actually deploy an agent at all. The Haarlemmer's support lead told us in the kickoff that her real objection to chat agents was not that they would be dumb. It was that they would be dumb in ways that took her three hours to clean up. The gate took that off the table.

What this changes, and what it does not

The Haarlemmer did not fire anyone. They did not plan to. Their tier-1 queue used to eat about 18 engineer-hours a week. It now eats about seven. Those eleven hours went into a project they had been wanting to start for two years, migrating their oldest customers off a PHP 7.4 cluster onto a current build. That is what the agent bought them. Not a smaller team. A team doing work that had been impossible to staff.

The framing matters. There is a popular argument right now that CEOs who think AI replaces their employees are the ones who never understood what those employees were actually doing. We agree. The reseller's tier-1 engineers were never the bottleneck. The queue was. Removing the queue made the engineers more valuable, not less.

If you run a hosting business and want to try this

Before you build anything, do one audit. Pull your last 200 tier-1 tickets. For each, write down whether the diagnosis was mechanical, whether the fix was mechanical, and whether it could have been snapshotted before the fix. The tickets where all three are yes are your starting allowlist. Everything else is the human's job for now.

When we built the chat agent for the Haarlemmer, the thing we kept running into was that "tier-1" had been a queue label, not a description of the work. We solved it by splitting the queue into "mechanically reversible" and "everything else" before the AI agent ever saw a ticket. That split is the audit you can do today, before any code gets written.

Key takeaway

Build chat agents that refuse to touch what they cannot snapshot first. Reversibility is the moat, not the model.

FAQ

What does the snapshot gate actually snapshot?

Only the scope of the proposed change: a zone file before a DNS edit, a mailbox before a password reset, a database before a restart. Not the whole server. Snapshots expire after 60 minutes.

Why DirectAdmin if customers call it cPanel?

The reseller migrated off cPanel two years ago when licence costs climbed, but customers still call the hosting panel cPanel by habit. The agent reads from DirectAdmin and replies in customer-friendly language.

Did the chat agent replace support engineers?

No. The tier-1 queue used to eat 18 engineer-hours a week; it now eats about 7. Those 11 hours went into a long-overdue PHP migration project. Headcount did not change.

What happens when the agent cannot snapshot?

The action does not happen. The ticket escalates to a human with the full diagnosis attached, the proposed action visible, and a note explaining which snapshot failed and why.

How is consent handled per customer?

The first time an action class touches a customer, a human approves it once. After that, it is pre-approved for that customer for 90 days. New action classes always need fresh approval.

chat agentsai agentsintegrationscase studyautomationoperations

Building something?

Start a project