Chat agents

Chat agents in accounting: a Zwolle firm's 60-second SLA

It is 17:42 on a Tuesday in Zwolle. A 27-person accountantskantoor has eighteen minutes before the SBR-loket closes. The chat agent has 60 seconds to park the awkward ones.

Jacob Molkenboer· Founder · A Brand New Company· 21 Jun 2026· 9 min

Brass call-annunciator panel with green tab, paper slip, leather blotter and brass bell on an ivory desk.

It is 17:42 on a Tuesday in March in Zwolle. An RA-controller — call her M. — has eighteen minutes before the SBR-loket at KvK closes its 18:00 deponeer-window. On her screen: 47 jaarrekening-vragen the chat agent flagged in the last hour, each tagged with a probable NV COS 4410 samenstellings-uitzondering. She works the queue at 22 seconds per item, signs eleven jaarrekeningen, sends three back to the dossierhouder for herclassificatie, and parks the rest until tomorrow. At 17:59:40 the last filing clears. She closes the laptop and goes home.

That cadence is what we built for. Here is what the agent does, and what it took to wire it into a fourteen-year-old stack without breaking anything.

The firm and the stack

Twenty-seven people. About 1,400 SME clients on the books — mostly retail, horeca, kleine bouw. Two RA-controllers, four AA-accountants, eleven dossierhouders, the rest in salaris en BTW. One IT-coördinator who is also the office manager. No DevOps. No data team.

The dossier side runs on CaseWare Cloud, which the firm bought in 2012 and has customised in seven different ways since. CaseWare holds the working papers, the trial balance, and the cliëntacceptatie. Almost nothing else lives there.

The actual klant-dossier-archief — every signed jaarrekening, every assurance-rapport, every concept-deponeerset since 2009 — sits in a homegrown SQL Server 2017 instance on a Hyper-V host in the serverruimte. The schema has 184 tables. There is one dbo.Documents table with 2.4M rows. It is not pretty, but it works, and the partner who built it still works at the firm. The chat agent has to read from both.

What the queue actually looks like

The chat lives on the firm's klantportaal. Clients drop in jaarrekening-vragen all day: "klopt deze afschrijvingstabel?", "waarom staat mijn DGA-lening nu in box 1?", "kunnen jullie de deponeerset vóór vrijdag inschieten?". We measured 1,520 berichten per week on average across the last full quarter. The peak week was 2,180, in the run-up to the 31 mei deponeerdeadline voor middelgrote bv's.

Most are routine. A trial balance lookup, a dossier-status check, a "wanneer is de RC-rekening getoetst" question. The agent answers those by querying CaseWare's REST API for the working file and joining against the SQL Server archive on KvK-nummer. Median round trip: 1.8 seconds.

The interesting ones are the NV COS 4410 cases — the samenstellings-uitzonderingen. NV COS 4410 is the Dutch standard for samenstellingsopdrachten. If the agent suspects a client question implies a discontinuïteit, a stelselwijziging, or a materiële schattingswijziging that the dossierhouder missed in the conceptfase, the agent is forbidden from answering. It has to park the message, raise a flag, and route it to a registeraccountant.

The 60-second SLA

The rule we agreed with the firm: any message with a probable 4410 exception is in the RA-controller queue within 60 seconds, with the relevant excerpt from the dossier inline.

Sixty seconds is not a technical limit. It is a behavioural one. The RA-controllers have a 17:00–18:00 deponeer-sprint on filing days. If the queue updates every hour, they miss things. If it updates every five minutes, they miss the conditions at 17:55 when the SBR adapter starts to misbehave under load. If it updates within sixty seconds, they trust it. Trust is what determines whether they work the queue at all.

Hitting sixty seconds for a model call with two database joins, a CaseWare API hop, and a reasoning step is not free. We had to be careful about three things: the classifier, the archive read, and what to do when the model says "maybe".

Classifying a 4410 exception

We did not start with a model. We started with a checklist the firm's senior RA wrote on the back of a print of the standard. It had seventeen rules. Some were keyword-trivial ("the word 'discontinuïteit' in the bericht"). Most were context-dependent ("the client's omzet dropped >30% YoY and the dossierhouder has not updated the going-concern note in the last 90 days").

The agent runs the checklist first, as deterministic Python. Roughly 38% of messages fail one of those rules on their own. Those go straight to the controller queue, no model call, in under four seconds. The classifier never sees them.

The remaining 62% go to a Claude Sonnet call with the message, the prior six months of the client's dossier-headers, and the seventeen rules in the system prompt. Output is a forced JSON object, typed so the call site can validate it before any routing decision:

from typing import Literal, Optional, TypedDict


class ClassifierOutput(TypedDict):
    is_4410_exception: bool
    rule_hit: Optional[str]                        # which of the 17, or None
    confidence: Literal["low", "medium", "high"]
    excerpt_to_show_ra: str                        # max 400 chars from the dossier
    fallback_reply_to_client: str                  # what the agent would have said

On a high confidence hit, the message is parked, the controller gets a notification, and the client sees a one-liner: "We laten dit even nakijken door een RA-collega. Je krijgt vandaag nog antwoord."

On low or medium the agent still answers the client, but copies the controller queue with the fallback_reply_to_client for post-hoc review. Roughly 6% of those get a "this should have been parked" correction in the weekly review, and the rule that should have fired is added to the deterministic checklist. The checklist is now at 31 rules and growing about one a week.

Takeaway

The moat is the weekly review of the medium-confidence path. The classifier improves not because the model gets smarter but because the deterministic ruleset absorbs everything the model got wrong.

Reading 14 years of dossier-archief in 200ms

The 2.4M-row dbo.Documents table was the second bottleneck. The original schema had a clustered index on InsertedAt and nothing on KvK-nummer. A lookup by client took 1.6 seconds on a cold cache, which blew the SLA before the model even ran.

We did not migrate the table. If we had, the partner who built it would have left the firm, and his goodwill is worth more than the seek time. We added a covering nonclustered index on (KvK_Nummer, DocumentType) INCLUDE (DocumentId, SignedAt, Title). Lookups dropped to 18ms. The original queries from CaseWare's custom integration still run unchanged.

We also added a small read-replica — actually just a SQL Server 2017 instance on a second Hyper-V host with snapshot replication every 60 seconds. The agent reads from the replica. If the replica is more than 90 seconds behind, the agent fails closed and parks the message for a controller with a note that the archive is stale. In the four months since go-live, that has happened twice. Both times because of a Windows Update reboot we should have scheduled outside business hours.

SBR and the 17:55 problem

The Standard Business Reporting layer at KvK has a known quirk that anyone who has ever filed a jaarrekening at 17:58 will recognise: between roughly 17:50 and 18:00 the deponeer-endpoint sometimes returns a 502 on the first POST and accepts on the retry. It is not documented. It is just true.

The agent does not file. The dossierhouder does. But the agent is asked, twenty times a day during deponeerweek, "is mijn jaarrekening al gedeponeerd?" If the agent answers from the SBR poll-endpoint at 17:54 and the response is stale by 17:56, the client thinks they are filed when they are not, and the controller spends Wednesday morning explaining a boete to a panicked DGA.

So the agent stops answering deponeer-status questions at exactly 17:45 on filing days. The client sees: "Tussen 17:45 en 18:15 controleren we deponeerstatussen handmatig. Geen zorgen — een collega bevestigt vóór 18:30 of je jaarrekening binnen is." It is a feature flag, not a model decision. The model is not allowed to override it. The system prompt does not even know the flag exists.

What the firm shut off

Three things we built and turned off, all at the firm's request, all correctly.

The first was an auto-reply for cliëntacceptatie-vragen. The firm decided, correctly, that anything touching acceptatie has to go through a partner, not a chat. We left the classification in (so partners get a faster queue) but killed the auto-reply. The partner now sees a triaged queue of acceptatie-signals on Monday morning instead of a stack of unsorted email.

The second was a CaseWare write path. The agent was going to mark dossier-velden as "controle gewenst" directly in the working file. The compliance officer pointed out that any write to a working file is an audit-trail event and the agent does not have an AA-nummer. We made it write to a separate annotations table that the dossierhouder reads as a sidebar in CaseWare. Same end-state, no audit problem.

The third was a Claude-driven SBR-adapter that would have generated the deponeer-XBRL on the fly. We were enthusiastic about it. The RA-controllers were not. The XBRL has to be byte-identical to what the firm signs off on, and a model that is "almost always" correct is the wrong tool for a filing the bestuurder is legally liable for. We shipped the existing template-driven adapter instead.

The numbers, four months in

Messages handled by the agent without controller touch: 1,182 of 1,520 weekly, about 78%.

Messages parked in the RA queue: 264 weekly, about 17%. The remaining 5% are escalated to the dossierhouder, not the controller — usually because the question is administrative rather than vaktechnisch.

Messages where the model was wrong and the weekly review corrected it: 91 over four months, averaging five to six per week and dropping as the deterministic ruleset grew.

Median time from "client hits send" to "parked in controller queue with dossier excerpt inline": 41 seconds. P95: 58 seconds. P99: 71 seconds, all on cold replicas after the daily snapshot rotation. Inside the SLA, with one knowable failure mode.

Hours given back to the two RA-controllers per week, measured by them, not by us: about nine. They use the hours to do more samenstellingen, not to leave earlier. That is the firm's choice.

What verified agents have to do with this

Two items from the front page this month are worth a line.

Anthropic announced on 19 June that some Claude API capabilities will require organisational ID verification from 8 July. For a firm that runs an agent in a regulated context — and "regulated" is putting it gently for an accountantskantoor — that is a feature, not a hurdle. The compliance officer was happier on Friday than she was on Thursday.

Cloudflare is shipping temporary, scoped accounts for AI agents. That maps to something we already do badly: provisioning a per-agent SQL login on the read-replica with a 24-hour manual rotation. The Cloudflare model is cleaner, and we will probably move to it for the next firm we build this for. The pattern is right: short-lived credentials scoped to one job.

The smallest thing you can do tomorrow morning

If you run an accountantskantoor — or any back-office that has a 60-minute window of stress at the end of each day — run this audit before you buy any tooling. For one week, log every message your team replies to with the timestamp of arrival and the timestamp of reply. Sort the rows. Look at the messages where the reply landed in the last fifteen minutes of the window. Those are the messages that should not have made it that far. Build for those first.

When we built the agent for the Zwolle firm, the part that took longest was not the classifier. It was learning that "under 60 seconds" and "correct or parked" are the same SLA, and that the wrong move is to let the model guess into a regulated outcome. If you want to see how that turns into wiring, our chat agents page has the architecture diagram.

Key takeaway

For regulated triage, 'under 60 seconds' and 'correct or parked' are the same SLA — every uncertain answer goes to a human, never a guess.

FAQ

Why park NV COS 4410 exceptions instead of letting the agent answer them?

Because a samenstellings-uitzondering is a vaktechnisch oordeel, not a lookup. The agent has no AA- or RA-nummer and the firm's bestuurder is legally liable for the filing. Parking is the only correct move.

Why not migrate the SQL Server 2017 archive to something modern?

Because the partner who built it still works at the firm and the schema is load-bearing across seven CaseWare customisations. A covering index and a 60-second snapshot replica got us inside the SLA without touching a working system.

What happens if the model is wrong about a 4410 exception?

Medium and low-confidence answers are copied into the controller queue with the fallback reply. The weekly review catches misclassifications and the rule that should have fired is added to the deterministic checklist, which now sits at 31 rules.

Why does the agent stop answering deponeer-status questions at 17:45?

Because the SBR endpoint goes flaky between 17:50 and 18:00 and a stale 'yes' answer turns into a KvK boete the next morning. A feature flag, not a model decision, freezes that path during the deponeer-sprint.

chat agentsai agentscase studyintegrationsworkflowoperations

Building something?

Start a project