Chat agents
Chat agent for accountants: shipping inside an M365 tenant
A 29-person Mechelen firm had partners triaging portal questions after dinner. We shipped a chat agent that clears 980 a week without ever leaking a Wwft flag.

It is 21:47 in Mechelen. The on-call partner at a 29-person accountancy firm watches her phone vibrate again. The client portal queue holds 41 unanswered questions from today. Most are about VAT timing on Q2 receipts. A few concern year-end provisions. One mentions a cash deposit from a director's brother-in-law that she does not want to misread before bed. The chat agent we built for that firm now handles 71% of those questions before she ever sees them.
We are not naming the firm in this post (the brief was direct: write up the build, leave the brand off the screenshot). What we can describe is the shape of the work, because the architecture is portable to any mid-size accountancy firm running Visma eAccounting and Yuki side by side on Microsoft 365.
The intake before we wrote a line of code
The firm was not asking for a chatbot. They were asking for a way to clear a portal backlog that had crossed 980 questions a week. Three associates were burning their evenings on the same six question types. Partners were skimming for Wwft flags after the kids went to bed. Turnover among second-year associates was high enough that two of the four had quit in the past year, citing the portal queue specifically.
We spent the first two weeks reading. Not configuring tools, not benchmarking models. Reading. Eight weeks of historical portal conversations, anonymised on a partner's laptop, exported as JSONL. We tagged 1,640 of them by hand. Six categories accounted for 84% of the volume:
- VAT timing and quarterly declarations (31%)
- Receipt and invoice upload questions (22%)
- Year-end provisions and depreciation (12%)
- Payroll edge cases (9%)
- Director's current-account questions (6%)
- Bank reconciliation mismatches (4%)
The remaining 16% was the long tail. The agent was never going to handle it. We told the partners that on day one. The agent's job was to clear the 84%, not the everything. Founders and ops leads underestimate how much of the agent's success depends on this conversation. Scope it small enough that the senior partner can name the question types from memory.
Why the M365 tenant constraint shaped everything
The firm operates under Belgian and European data rules. Client conversations cover beneficial ownership, salary slips, sometimes medical-leave context for sick-pay calculations. They had already moved everything onto Microsoft 365 with EU-resident tenancy. The hard constraint at kick-off was that no client message could leave the tenant.
That ruled out a SaaS chat widget that ships transcripts to a US-hosted vendor. It also ruled out the path of least resistance, which would have been a generic ChatGPT-style integration with no audit trail. We needed the agent loop, the conversation history, and the embedding index all to live inside their Azure subscription.
The shape we landed on:
- Teams as the client-facing surface for internal staff. Outlook plus portal embed for clients.
- Azure OpenAI in the firm's West Europe region for inference. Conversation logs in their own Cosmos DB, retention configurable per client.
- Embeddings indexed in Azure AI Search, fed from a curated knowledge base the senior partner maintains.
- Visma eAccounting and Yuki accessed as read-mostly tools. Write actions gated behind explicit human approval.
The Azure OpenAI data, privacy and security docs were our reference for what stays inside the tenant. The short version: prompts and completions are not used for training, and with abuse monitoring opted out (which the firm qualifies for) nothing leaves the customer's Azure region.
The Visma and Yuki tool surface
Both accounting platforms expose a REST API. Visma eAccounting's is OAuth-scoped per company. Yuki uses a session token plus an administration ID. We wrapped each platform behind a small adapter, then registered tools the agent could call. The point was to keep the agent ignorant of which platform a given client used. If the client's books were in Yuki, the agent called get_open_invoices and the adapter routed it. Same for Visma.
The tool definitions, slimmed to three examples:
// agent/tools.ts
export const tools = [
{
name: "get_open_invoices",
description: "Return the client's open sales invoices, oldest first.",
input_schema: {
type: "object",
properties: {
client_id: { type: "string" },
as_of: { type: "string", format: "date" },
},
required: ["client_id"],
},
},
{
name: "get_vat_position",
description: "Return VAT collected vs. paid for the current quarter.",
input_schema: {
type: "object",
properties: {
client_id: { type: "string" },
quarter: { type: "string", pattern: "^\\d{4}-Q[1-4]$" },
},
required: ["client_id", "quarter"],
},
},
{
name: "flag_for_partner",
description: "Escalate the thread to the partner-on-call. Use for Wwft, dispute, or any answer the model is less than 80% confident on.",
input_schema: {
type: "object",
properties: {
reason: { type: "string", enum: ["wwft", "dispute", "low_confidence", "out_of_scope"] },
summary: { type: "string", maxLength: 400 },
},
required: ["reason", "summary"],
},
},
];
The adapter behind each tool is boring code. It is also where 60% of the bugs lived in the first month. Visma returns currency as a string with a comma separator if the company locale is Dutch. Yuki returns it as a float. Date formats disagree across endpoints inside the same platform. We wrote a normalisation layer and tests for it before the agent ever saw the data. If you skip this, the agent will tell a client their VAT position is wrong by a factor of 100 because someone parsed "1.234,56" as 1.23456. We have seen that exact bug.
The Wwft router
Belgium's anti-money-laundering rules (the Wwft, transposing the EU AML directives) require accountants to flag specific situations: unusual cash movements, beneficial-ownership changes, certain politically-exposed-person interactions, and a handful of typology patterns the firm tracks per their own risk appetite. The FOD Financiën AML overview covers the framework.
The hard rule from the firm's compliance lead was: if anything in the conversation could touch a Wwft obligation, no automated reply goes out. The partner-on-call sees it first.
We implemented the router in two layers. Layer one is a deterministic regex pass over the inbound message: cash thresholds above €3,000, mentions of specific jurisdictions, beneficial-ownership keywords in Dutch and French. Cheap, fast, no model in the loop, easy for the compliance officer to audit and edit. Layer two is the agent's own judgement: we instructed it to call flag_for_partner with reason wwft whenever the deterministic pass had missed something but the model still smelled risk.
# router/wwft_pre_filter.py
import re
CASH_THRESHOLD_EUR = 3000
CASH_PATTERN = re.compile(
r"(?:\b|\u20ac\s?)([0-9]{1,3}(?:[.,][0-9]{3})+|[0-9]{4,})(?:\s?\u20ac| EUR)?",
re.IGNORECASE,
)
KEYWORD_PATTERNS = [
re.compile(r"\bcontant(?:e)?\b", re.IGNORECASE),
re.compile(r"\besp[\u00e8e]ces\b", re.IGNORECASE),
re.compile(r"\buiteindelijk\s+begunstig", re.IGNORECASE),
re.compile(r"\bb[\u00e9e]n[\u00e9e]ficiaire\s+effectif", re.IGNORECASE),
re.compile(r"\bPEP\b"),
]
def wwft_pre_flag(message: str) -> dict | None:
for amount_match in CASH_PATTERN.finditer(message):
raw = amount_match.group(1).replace(".", "").replace(",", "")
try:
if int(raw) >= CASH_THRESHOLD_EUR:
return {"reason": "wwft", "trigger": "cash_threshold", "amount": int(raw)}
except ValueError:
continue
for pattern in KEYWORD_PATTERNS:
if pattern.search(message):
return {"reason": "wwft", "trigger": pattern.pattern}
return None
If the pre-filter returns a hit, the agent never runs. The message goes straight to the partner-on-call channel in Teams with a one-line summary. The conversation stays open. The client sees "your accountant is reviewing this personally," which is true.
Do not let the agent decide alone whether a Wwft trigger applies. A deterministic pre-filter is auditable. A model is not. The compliance officer should be able to read the full regex list in a single sitting.
The agent loop, with proactivity throttled
The worry we hear most often from founders looking at agents in production is the one about a loop that quietly runs up bills or takes actions behind its operator's back. An agent loop without explicit limits will find creative ways to spend tokens and trust.
We capped this one hard. The loop runs a maximum of six tool calls per user message. Beyond that, it flags for partner with reason low_confidence. Inference costs are tracked per conversation and per client, so the senior partner can see which client questions cost the most to answer. The cap was never hit in production, but it was hit twice in load testing, which is how we discovered a bug where Yuki was returning a paginated response the agent kept trying to walk to page 99.
// agent/loop.ts
const MAX_TOOL_CALLS = 6;
async function run(thread: Thread, message: string) {
if (await wwftPreFlag(message)) return escalate(thread, "wwft");
let calls = 0;
let response = await model.respond(thread, message);
while (response.tool_calls.length && calls < MAX_TOOL_CALLS) {
const results = await Promise.all(
response.tool_calls.map((c) => runTool(c, thread.clientId)),
);
calls += results.length;
response = await model.respond(thread, message, results);
}
if (response.tool_calls.length) return escalate(thread, "low_confidence");
if (response.confidence < 0.8) return escalate(thread, "low_confidence");
return reply(thread, response.text);
}
Eight weeks in, the numbers
The agent went live for the first ten clients in week six of the build. Full rollout was week ten. At eight weeks post-rollout, the firm reported:
- 980 weekly portal questions, of which 71% answered by the agent without partner touch.
- 22% escalated to associates (mostly long-tail or low confidence).
- 7% escalated to partner-on-call (Wwft, disputes, complex provisions).
- Average first-response time fell from 9.4 hours to 11 minutes.
- Two associates moved from portal triage onto advisory work full time.
The number we were most nervous about was the Wwft escalation rate. The senior partner had told us flatly: if the agent silently answers one cash-deposit question wrong, the build is a failure regardless of any other metric. In production the pre-filter has caught every Wwft-relevant message a partner has gone back and audited. The agent's own flag_for_partner with reason wwft has fired 19 times, of which the partner agreed with 14. The five false positives went into the prompt and the regex list as negative examples.
What we would do differently
Two things. First, we should have built the partner dashboard before we built the agent. For the first two weeks of production the partners had to read raw Teams threads to audit what the agent had done. A simple table of "today's escalations, today's auto-answers, today's flagged-and-resolved" would have saved them a lot of time. We shipped it in week three. It should have been week zero.
Second, we underestimated how much the agent's tone mattered. The first version answered like a textbook. Clients did not love it. The senior partner spent a Saturday with us rewriting the system prompt around how he actually talks to clients (short sentences, no jargon, ask one clarifying question before answering anything ambiguous). The acceptance rate climbed eleven points the following week.
An accountancy chat agent is two systems in a trench coat: a deterministic compliance router that never lets a Wwft message reach the model, and a confidence-capped agent loop that escalates anything it is not sure about.
The five-minute audit you can run on Monday
If you run a firm with a portal queue and you are thinking about an agent, do this before you talk to anyone (us or otherwise). Pull a week of portal conversations. Tag them by hand into five to eight categories. Calculate what share of the volume the top three categories represent. If it is above 60%, an agent has somewhere to start. If it is below 40%, the queue is a symptom of something else (probably client onboarding) and an agent will not fix it. When we built this chat agent for the Mechelen firm the answer was 64%, and that was the number that made the project go ahead.
Key takeaway
An accountancy chat agent is a deterministic Wwft router and a confidence-capped agent loop in one trench coat. Build the router first.
FAQ
Why keep everything inside the firm's own Microsoft 365 tenant?
Belgian and European data rules, plus the fact that portal messages cover beneficial ownership, salaries and sometimes medical context. A SaaS chat widget shipping transcripts to a US vendor is not defensible to the firm's compliance lead.
Why a deterministic regex pre-filter for Wwft instead of letting the model decide?
A regex list is auditable by a compliance officer in one sitting. A model is not. The pre-filter runs before the model, so a flagged message never reaches inference at all.
How do you stop the agent from looping on tool calls forever?
Cap tool calls per user message (we used six) and require a confidence threshold on the final answer. Anything that exceeds the cap or falls under the threshold escalates to a human, not back into the loop.
Does this only work with Visma eAccounting and Yuki?
No. The agent calls platform-agnostic tools like get_open_invoices. Adapters route to whichever platform the client uses. Adding Exact Online or Twinfield is an adapter and a credential, not an agent rewrite.